Customers with high retention potential are more likely to continue to do business with a company, purchase more products or services, and recommend the company to others. They tend to have a positive perception of the company, trust its brand, and feel satisfied with their overall experience.On the other hand, customers with low retention potential are at risk of churning, or ending their business relationship with the company. They may have had a negative experience with the company or its products, feel dissatisfied, or perceive a lack of value in their interactions with the company.By identifying and analyzing the retention potential of its customers, a company can develop strategies to improve customer loyalty and satisfaction, increase customer lifetime value, and reduce churn rates.
Attributes present in our data:-
1)CustomerID:-A unique identifier number a customer have for himself
2)City:-A city which a customer belong to.
3)DateOfRegistration:-Date when a customer registered.
4)OnlineCommunication:Whether a customer prefers online delivery or not.
5)AutomaticRefill:-Whether a customer likes his account to be refilled or not automatically when a term ends.
6)DoorstepDelivery:It is customer choice whether he prefer doorstep delivery or not.
7)PreferredDeliveryDay:A day from week on which customer prefers delivery.
8)Retention potential:It is a measure of the customer's loyalty and satisfaction with the company's products or services.
9)DateOfOrder:-Date of order generally refers to the date on which a customer placed an order with a company or a business.
10)timestamp:-A timestamp typically includes the year, month, day, hour, minute, and second of the event, and sometimes also includes information about the time zone or the time elapsed since a reference point.
11)OrderQuantity:-Order quantity generally refers to the number of units or items of a product that a customer has ordered from a business or a company. This data point is typically recorded in the company's database or order management system and is used for tracking and managing customer orders.
12)DateOfemail:-"Date of email" refers to the date and time when an email was sent or received by a recipient. This data point is typically recorded in the email client or email service provider's database and is used for tracking and managing email communications.
13)EmailType:-Email type generally refers to the category or classification of an email message based on its content or purpose.
14)timestamp:-A timestamp typically includes the year, month, day, hour, minute, and second of the event, and sometimes also includes information about the time zone or the time elapsed since a reference point.
15)MailOpened:-The mail opened metric is commonly used in email marketing to measure the effectiveness of email campaigns and to understand the behavior of recipients.
16)MailClicked:-The mail clicked metric is commonly used in email marketing to measure the effectiveness of email campaigns and to understand the behavior of recipients.
#Import the required Libraries.
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import classification_report
#chi-square test of independence
from scipy.stats import chi2_contingency
from scipy.stats import chi2
import pandas as pd
import numpy as np
from IPython.display import Image
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.preprocessing import StandardScaler,OneHotEncoder
from sklearn.ensemble import AdaBoostClassifier,RandomForestClassifier,GradientBoostingClassifier
from sklearn.metrics import accuracy_score,classification_report, recall_score, precision_score, confusion_matrix, f1_score
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
import graphviz
import warnings
warnings.filterwarnings('ignore')
## conda install -c anaconda statsmodels
import statsmodels.api as sm
get_ipython().run_line_magic('matplotlib', 'inline')
pd.set_option('display.max_columns',125)
pd.set_option('display.max_rows',125)
customer=pd.read_excel("/content/Customerdata_Train.xlsx")
train=pd.read_excel("/content/Train.xlsx")
transaction=pd.read_excel("/content/transactionsdata_Train (1).xlsx")
email=pd.read_excel("/content/Emaildata_Train (1).xlsx")
transaction.columns
Index(['CustomerID', 'DateOfOrder', 'timestamp', 'OrderQuantity'], dtype='object')
#checking all the datasets shape to get a proper understanding
print("shape of customer",customer.shape)
print("shape of train",train.shape)
print("shape of transaction",transaction.shape)
print("shape of email",email.shape)
shape of customer (23820, 7) shape of train (23820, 2) shape of transaction (104755, 4) shape of email (517672, 6)
From this we have a understanding that our target column consists of 23820 rows,so it indirectly means we have to filter our datasets in such a way that we get 23820 unique values in customer id column for each dataset.so we will go through all the 4 datasets and check what to filter.
lets start with customerdata
#customer.head()
customer["CustomerID"].nunique()
#checking unique values for all the customer ids
23820
customer.head()
| CustomerID | City | DateOfRegistration | OnlineCommunication | AutomaticRefill | DoorstepDelivery | PreferredDeliveryDay | |
|---|---|---|---|---|---|---|---|
| 0 | C16368 | CITY2 | 2012-09-28 | 0 | 0 | 0 | Monday |
| 1 | C20345 | CITY2 | 2010-12-19 | 1 | 1 | 1 | Friday |
| 2 | C17683 | CITY2 | 2010-10-03 | 0 | 0 | 0 | Wednesday |
| 3 | C18647 | CITY4 | 2010-11-27 | 0 | 0 | 0 | Monday |
| 4 | C36692 | CITY2 | 2009-12-04 | 1 | 1 | 0 | Friday |
train
#train.head()
train["CustomerID"].nunique()
#checking unique values for all the customer ids
23820
transaction
#transaction['DateOfOrder'] = pd.to_datetime(transaction['DateOfOrder'])
transaction["CustomerID"].nunique()
#checking unique values for all the customer ids
23820
# combine the 'Dateoforder' and 'Timestamp' columns into a single datetime column
transaction['last_transaction_date'] = pd.to_datetime(transaction['DateOfOrder'] +' '+transaction['timestamp'])
monthly_orders = transaction.groupby(pd.Grouper(key='last_transaction_date', freq='M')).agg({'OrderQuantity': 'sum'})
# Create a subplot with 1 row and 2 columns
fig, axs = plt.subplots(1, 2, figsize=(14, 2))
# Plot monthly order quantities as a line chart
axs[0].plot(monthly_orders.index, monthly_orders['OrderQuantity'], color='blue')
axs[0].set_title('Monthly Order Quantities-lineplot')
axs[0].set_xlabel('Date')
axs[0].set_ylabel('Order Quantity')
# Plot yearly order quantities as a bar chart
axs[1].bar(monthly_orders.index.year, monthly_orders['OrderQuantity'], color='green')
axs[1].set_title('Monthly Order Quantities-barplot')
axs[1].set_xlabel('Year')
axs[1].set_ylabel('Order Quantity')
# Display the plot
plt.show()
*Key points:-if we observe here,our monthly orders were highest during the period of 2013-2014**
yearly_orders = transaction.groupby(pd.Grouper(key='last_transaction_date', freq='Y')).agg({'OrderQuantity': 'sum'})
# Create a subplot with 1 row and 2 columns
fig, axs = plt.subplots(1, 2, figsize=(14, 2))
# Plot monthly order quantities as a line chart
axs[0].plot(yearly_orders.index, yearly_orders['OrderQuantity'], color='blue')
axs[0].set_title('yearly Order Quantities-lineplot')
axs[0].set_xlabel('Date')
axs[0].set_ylabel('Order Quantity')
# Plot yearly order quantities as a bar chart
axs[1].bar(yearly_orders.index.year, yearly_orders['OrderQuantity'], color='green')
axs[1].set_title('yearly Order Quantities-barplot')
axs[1].set_xlabel('Year')
axs[1].set_ylabel('Order Quantity')
# Display the plot
plt.show()
If we observe here,yearly order quantities were highest during 2011-2014
Customer retention: By analyzing the average time between purchases, businesses can determine if their customers are coming back to make repeat purchases or if they are churning. If the average time between purchases is increasing over time, it may indicate that customers are losing interest in the business or finding better alternatives.
transaction['DateOfOrder']=pd.to_datetime(transaction['DateOfOrder'])
# Sort the dataset by customer ID and purchase date
df3 = transaction.sort_values(['CustomerID', 'DateOfOrder'])
# Calculate the time between purchases for each customer
df3['time_between_purchases'] = df3.groupby('CustomerID')['DateOfOrder'].diff().dt.days.fillna(0)
df3['order_month']=pd.to_datetime(df3['DateOfOrder']).dt.month
# group the data by customer ID and month, and count the number of orders in each group
order_counts = df3.groupby(['CustomerID', 'order_month'])['OrderQuantity'].sum()
# use the unstack method to pivot the table so that the rows represent customers and the columns represent months
order_counts = order_counts.unstack(level=-1, fill_value=0)
# fill any missing values with 0
order_counts = order_counts.fillna(0)
order_counts.columns=["Totalorders-jan","Totalorders-feb","Totalorders-march","Totalorders-april","Totalorders-may","Totalorders-june","Totalorders-july","Totalorders-august","Totalorders-september","Totalorders-october","Totalorders-november","Totalorders-december"]
order_counts=order_counts.reset_index(drop=False)
# Group the data by customer ID and calculate the number of transactions, total amount spent, and last transaction date
grouped_data = df3.groupby('CustomerID').agg({
'CustomerID': 'count', # Number of transactions
'OrderQuantity': ['sum','min','max'],# Total orders
'last_transaction_date': ['min', 'max'],# first transaction date,Last transaction date
'time_between_purchases':'mean'#avaerage time between purchases
})
# Rename the columns to be more descriptive
grouped_data.columns = ['num_transactions','totalorders','minimumorderquantity','maximumorderquantity','first_transaction_date','last_transaction_date','average_time_between_purchase']
grouped_data['DateOfOrder'] = grouped_data['last_transaction_date'].dt.date.astype(str)
grouped_data['timestamp'] = grouped_data['last_transaction_date'].dt.time.apply(lambda x: x.strftime('%H:%M:%S'))
grouped_data["averageorders"]=grouped_data["totalorders"]//grouped_data["num_transactions"]
grouped_data = grouped_data.reset_index(drop=False)
# merge the grouped DataFrame back with the main DataFrame
transaction_data = pd.merge(df3, grouped_data, on=['CustomerID','last_transaction_date'],how="inner")
transaction_data = pd.merge(transaction_data, order_counts, on=['CustomerID'],how="inner")
transaction['last_transaction_date']
transaction = transaction.rename(columns={'last_transaction_date': 'transaction_date'})
#transaction_data
transaction_data.shape
#rechecking for incorrect dtypes
#transaction_data["CustomerID"].nunique()
#rechecking for unique customerids
(23820, 28)
transaction_data=transaction_data.drop(['DateOfOrder_x', 'timestamp_x','DateOfOrder_y', 'timestamp_y','time_between_purchases','order_month'],axis=1)
transaction_data.head()
| CustomerID | OrderQuantity | last_transaction_date | num_transactions | totalorders | minimumorderquantity | maximumorderquantity | first_transaction_date | average_time_between_purchase | averageorders | Totalorders-jan | Totalorders-feb | Totalorders-march | Totalorders-april | Totalorders-may | Totalorders-june | Totalorders-july | Totalorders-august | Totalorders-september | Totalorders-october | Totalorders-november | Totalorders-december | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | C12116 | 33 | 2013-10-10 19:50:02 | 3 | 115 | 33 | 44 | 2013-10-03 13:49:08 | 2.333333 | 38 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 115 | 0 | 0 |
| 1 | C12117 | 41 | 2013-12-05 18:47:14 | 1 | 41 | 41 | 41 | 2013-12-05 18:47:14 | 0.000000 | 41 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 41 |
| 2 | C12118 | 41 | 2013-09-10 10:08:42 | 1 | 41 | 41 | 41 | 2013-09-10 10:08:42 | 0.000000 | 41 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 41 | 0 | 0 | 0 |
| 3 | C12119 | 52 | 2013-10-22 20:48:00 | 1 | 52 | 52 | 52 | 2013-10-22 20:48:00 | 0.000000 | 52 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 52 | 0 | 0 |
| 4 | C12120 | 31 | 2014-01-10 13:24:12 | 1 | 31 | 31 | 31 | 2014-01-10 13:24:12 | 0.000000 | 31 | 31 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
for email
email.head()
| CustomerID | DateOfemail | EmailType | timestamp | MailOpened | MailClicked | |
|---|---|---|---|---|---|---|
| 0 | C16368 | 2012-09-28 | WelcomeEmail | 15:56:50 | yes | NotApplicable |
| 1 | C20345 | 2010-12-19 | WelcomeEmail | 15:10:30 | yes | NotApplicable |
| 2 | C17683 | 2010-10-03 | WelcomeEmail | 17:12:24 | yes | NotApplicable |
| 3 | C18647 | 2010-11-27 | WelcomeEmail | 16:59:02 | yes | NotApplicable |
| 4 | C36692 | 2009-12-04 | WelcomeEmail | 17:27:22 | yes | NotApplicable |
email.columns
Index(['CustomerID', 'DateOfemail', 'EmailType', 'timestamp', 'MailOpened',
'MailClicked'],
dtype='object')
# combine the 'Date of email' and 'Time' columns into a single datetime column
email['emailDatetime'] = pd.to_datetime(email['DateOfemail'] +' '+email['timestamp'])
email["EmailType"].value_counts()
CustomizedEmail 493852 WelcomeEmail 23820 Name: EmailType, dtype: int64
email.columns
Index(['CustomerID', 'DateOfemail', 'EmailType', 'timestamp', 'MailOpened',
'MailClicked', 'emailDatetime'],
dtype='object')
lets try to visualize different email categories
#lets visualize yearwise
monthly_mails = email.groupby(pd.Grouper(key='emailDatetime', freq='M')).agg({'emailDatetime': 'count'})
# Create a subplot with 1 row and 2 columns
fig, axs = plt.subplots(1, 2, figsize=(14, 2))
# Plot monthly order quantities as a line chart
axs[0].plot(monthly_mails.index, monthly_mails['emailDatetime'], color='blue')
axs[0].set_title('Monthly mails-lineplot')
axs[0].set_xlabel('Date')
axs[0].set_ylabel('Mails')
# Plot monthly order quantities as a bar chart
axs[1].bar(monthly_mails.index.year, monthly_mails['emailDatetime'], color='green')
axs[1].set_title('Monthly mails-barplot')
axs[1].set_xlabel('Year')
axs[1].set_ylabel('Mails')
# Display the plot
plt.show()
If we observe monthly mails during 2013-2014,they were highest
#lets visualize yearwise
yearly_mails = email.groupby(pd.Grouper(key='emailDatetime', freq='Y')).agg({'emailDatetime': 'count'})
# Create a subplot with 1 row and 2 columns
fig, axs = plt.subplots(1, 2, figsize=(14, 2))
# Plot monthly order quantities as a line chart
axs[0].plot(yearly_mails.index, yearly_mails['emailDatetime'], color='blue')
axs[0].set_title('yearly mails-lineplot')
axs[0].set_xlabel('Date')
axs[0].set_ylabel('Mails')
# Plot yearly order quantities as a bar chart
axs[1].bar(yearly_mails.index.year, yearly_mails['emailDatetime'], color='green')
axs[1].set_title('yearly mails-barplot')
axs[1].set_xlabel('Year')
axs[1].set_ylabel('Mails')
# Display the plot
plt.show()
if we observe properly,monthly order quantities and mails were highest during 2013-2014 period so it proves that mails sent directly proportional to order quantities.
# group the data by customer ID and month, and count the number of times mail oponed or not for each customer
mail_open_counts = email.groupby(['CustomerID', 'MailOpened'])['emailDatetime'].count()
# use the unstack method to pivot the table so that the rows represent customers and the columns represent mail oponed
mail_open_counts = mail_open_counts.unstack(level=-1, fill_value=0)
# fill any missing values with 0
mail_open_counts = mail_open_counts.fillna(0)
mail_open_counts.columns=["Totalmails-notoponed","Totalmails-oponed"]
mail_open_counts=mail_open_counts.reset_index(drop=False)
We had calculated the total number of times a customer had oponed and total number of times a customer had not oponed his mail on receival of mail
#mail_open_counts
# group the data by customer ID and month, and count the number of times mail oponed or not for each customer
MailClicked_counts = email.groupby(['CustomerID', 'MailClicked'])['emailDatetime'].count()
# use the unstack method to pivot the table so that the rows represent customers and the columns represent mail oponed
MailClicked_counts = MailClicked_counts.unstack(level=-1, fill_value=0)
# fill any missing values with 0
MailClicked_counts = MailClicked_counts.fillna(0)
MailClicked_counts.columns=["Notapplicableclickedmails","timesclicked-yes","timesclicked-no"]
MailClicked_counts=MailClicked_counts.reset_index(drop=False)
total number of times a customer has clicked on mail and total number of times a customer had not clicked his mail from all the mails received
MailClicked_counts
| CustomerID | Notapplicableclickedmails | timesclicked-yes | timesclicked-no | |
|---|---|---|---|---|
| 0 | C12116 | 1 | 29 | 2 |
| 1 | C12117 | 1 | 0 | 0 |
| 2 | C12118 | 1 | 0 | 0 |
| 3 | C12119 | 1 | 37 | 4 |
| 4 | C12120 | 1 | 0 | 0 |
| ... | ... | ... | ... | ... |
| 23815 | C42185 | 1 | 43 | 0 |
| 23816 | C42186 | 1 | 0 | 0 |
| 23817 | C42187 | 1 | 0 | 0 |
| 23818 | C42188 | 1 | 0 | 0 |
| 23819 | C42189 | 1 | 50 | 0 |
23820 rows × 4 columns
There are two possibilities here:
#filtering customers who have received welcome mail but not a customized mail
welcome_mail = email[(email['EmailType'] == 'WelcomeEmail') & ~(email['CustomerID'].isin(email[email['EmailType'] == 'CustomizedEmail']['CustomerID'].unique()))]
welcome_mail["CustomerID"].nunique()
9709
welcome_mail.head()
| CustomerID | DateOfemail | EmailType | timestamp | MailOpened | MailClicked | emailDatetime | |
|---|---|---|---|---|---|---|---|
| 2 | C17683 | 2010-10-03 | WelcomeEmail | 17:12:24 | yes | NotApplicable | 2010-10-03 17:12:24 |
| 15 | C29837 | 2011-02-22 | WelcomeEmail | 16:30:28 | yes | NotApplicable | 2011-02-22 16:30:28 |
| 18 | C13691 | 2011-03-10 | WelcomeEmail | 12:38:08 | yes | NotApplicable | 2011-03-10 12:38:08 |
| 25 | C14905 | 2009-04-24 | WelcomeEmail | 10:38:10 | yes | NotApplicable | 2009-04-24 10:38:10 |
| 30 | C22750 | 2011-01-04 | WelcomeEmail | 12:26:48 | yes | NotApplicable | 2011-01-04 12:26:48 |
welcome_mail.shape
(9709, 7)
Analyzing the average time between each email can provide businesses with some insights into customer engagement and help them identify opportunities to improve customer retention.
welcome_mail['DateOfemail']=pd.to_datetime(welcome_mail['DateOfemail'])
# Sort the dataset by customer ID and purchase date
welcome_mail = welcome_mail.sort_values(['CustomerID', 'DateOfemail'])
# Calculate the time between purchases for each customer
welcome_mail['time_between_emails'] = welcome_mail.groupby('CustomerID')['DateOfemail'].diff().dt.days.fillna(0)
# Group the data by customer ID and calculate the number of emails sent to customer,first email sent, and last email date
grouped_data3 = welcome_mail.groupby('CustomerID').agg({
'CustomerID': 'count', # Number of emails
'emailDatetime': ['max','min'], # Last email date,first email date
'time_between_emails':'mean'
})
# Rename the columns to be more descriptive
grouped_data3.columns = ['numofemails','lastemailDatetime','firstemailDatetime','average_time_between_mails']
grouped_data3 = grouped_data3.reset_index(drop=False)
grouped_data3['DateOfemail'] = grouped_data3['lastemailDatetime'].dt.date.astype(str)
grouped_data3['timestamp'] = grouped_data3['lastemailDatetime'].dt.time.apply(lambda x: x.strftime('%H:%M:%S'))
welcome_mail['DateOfemail']=welcome_mail['DateOfemail'].dt.date.astype(str)
# merge the grouped DataFrame back with the main DataFrame
merged_data3 = pd.merge(welcome_mail, grouped_data3, on=['CustomerID','DateOfemail','timestamp'],how="inner")
#merged_data.shape
#merged_data["CustomerID"].nunique()
#merged_data.columns
merged_data3=merged_data3.drop(['DateOfemail','timestamp'],axis=1)
#filtering persons who received customized mail
customizedmail=email[email['EmailType'] == 'CustomizedEmail']
customizedmail.head()
| CustomerID | DateOfemail | EmailType | timestamp | MailOpened | MailClicked | emailDatetime | |
|---|---|---|---|---|---|---|---|
| 23820 | C16368 | 2013-05-20 | CustomizedEmail | 15:10:30 | yes | no | 2013-05-20 15:10:30 |
| 23821 | C16368 | 2013-02-05 | CustomizedEmail | 13:54:22 | yes | no | 2013-02-05 13:54:22 |
| 23822 | C16368 | 2012-11-26 | CustomizedEmail | 11:22:32 | yes | no | 2012-11-26 11:22:32 |
| 23823 | C16368 | 2013-05-30 | CustomizedEmail | 14:04:42 | yes | no | 2013-05-30 14:04:42 |
| 23824 | C16368 | 2013-05-18 | CustomizedEmail | 10:18:16 | yes | no | 2013-05-18 10:18:16 |
customizedmail['DateOfemail']=pd.to_datetime(customizedmail['DateOfemail'])
# Sort the dataset by customer ID and purchase date
customizedmail = customizedmail.sort_values(['CustomerID', 'DateOfemail'])
# Calculate the time between purchases for each customer
customizedmail['time_between_emails'] = customizedmail.groupby('CustomerID')['DateOfemail'].diff().dt.days.fillna(0)
# Group the data by customer ID and calculate the number of emails sent to customer,first email sent, and last email date
grouped_data2 = customizedmail.groupby('CustomerID').agg({
'CustomerID': 'count', # Number of emails
'emailDatetime': ['max','min'], # Last email date,first email date
'time_between_emails':'mean'
})
grouped_data2
| CustomerID | emailDatetime | time_between_emails | ||
|---|---|---|---|---|
| count | max | min | mean | |
| CustomerID | ||||
| C12116 | 31 | 2013-10-09 11:49:08 | 2013-09-24 11:48:56 | 0.483871 |
| C12119 | 41 | 2013-10-21 10:24:24 | 2013-04-22 17:13:42 | 4.439024 |
| C12122 | 45 | 2013-04-09 10:21:20 | 2012-11-13 10:29:56 | 3.266667 |
| C12123 | 28 | 2014-01-10 13:59:24 | 2013-11-16 16:01:48 | 1.964286 |
| C12124 | 32 | 2013-11-06 17:45:54 | 2013-10-31 13:22:32 | 0.187500 |
| ... | ... | ... | ... | ... |
| C42181 | 47 | 2013-10-29 11:53:36 | 2011-08-14 10:06:08 | 17.170213 |
| C42182 | 49 | 2013-11-28 17:51:08 | 2012-08-31 10:32:00 | 9.265306 |
| C42183 | 41 | 2017-08-13 17:43:26 | 2017-08-07 11:18:38 | 0.146341 |
| C42185 | 43 | 2011-12-30 13:52:52 | 2011-03-05 11:01:04 | 6.976744 |
| C42189 | 50 | 2013-11-05 16:32:30 | 2013-01-20 16:07:58 | 5.780000 |
14111 rows × 4 columns
# Rename the columns to be more descriptive
grouped_data2.columns = ['numofemails','lastemailDatetime','firstemailDatetime','average_time_between_mails']
grouped_data2 = grouped_data2.reset_index(drop=False)
grouped_data2.head()
| CustomerID | numofemails | lastemailDatetime | firstemailDatetime | average_time_between_mails | |
|---|---|---|---|---|---|
| 0 | C12116 | 31 | 2013-10-09 11:49:08 | 2013-09-24 11:48:56 | 0.483871 |
| 1 | C12119 | 41 | 2013-10-21 10:24:24 | 2013-04-22 17:13:42 | 4.439024 |
| 2 | C12122 | 45 | 2013-04-09 10:21:20 | 2012-11-13 10:29:56 | 3.266667 |
| 3 | C12123 | 28 | 2014-01-10 13:59:24 | 2013-11-16 16:01:48 | 1.964286 |
| 4 | C12124 | 32 | 2013-11-06 17:45:54 | 2013-10-31 13:22:32 | 0.187500 |
grouped_data2['DateOfemail'] = grouped_data2['lastemailDatetime'].dt.date.astype(str)
grouped_data2['timestamp'] = grouped_data2['lastemailDatetime'].dt.time.apply(lambda x: x.strftime('%H:%M:%S'))
customizedmail['DateOfemail']=customizedmail['DateOfemail'].dt.date.astype(str)
# merge the grouped DataFrame back with the main DataFrame
merged_data = pd.merge(customizedmail, grouped_data2, on=['CustomerID','DateOfemail','timestamp'],how="inner")
merged_data
| CustomerID | DateOfemail | EmailType | timestamp | MailOpened | MailClicked | emailDatetime | time_between_emails | numofemails | lastemailDatetime | firstemailDatetime | average_time_between_mails | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | C12116 | 2013-10-09 | CustomizedEmail | 11:49:08 | no | no | 2013-10-09 11:49:08 | 0.0 | 31 | 2013-10-09 11:49:08 | 2013-09-24 11:48:56 | 0.483871 |
| 1 | C12119 | 2013-10-21 | CustomizedEmail | 10:24:24 | no | no | 2013-10-21 10:24:24 | 9.0 | 41 | 2013-10-21 10:24:24 | 2013-04-22 17:13:42 | 4.439024 |
| 2 | C12122 | 2013-04-09 | CustomizedEmail | 10:21:20 | no | no | 2013-04-09 10:21:20 | 6.0 | 45 | 2013-04-09 10:21:20 | 2012-11-13 10:29:56 | 3.266667 |
| 3 | C12123 | 2014-01-10 | CustomizedEmail | 13:59:24 | no | no | 2014-01-10 13:59:24 | 1.0 | 28 | 2014-01-10 13:59:24 | 2013-11-16 16:01:48 | 1.964286 |
| 4 | C12124 | 2013-11-06 | CustomizedEmail | 17:45:54 | no | no | 2013-11-06 17:45:54 | 0.0 | 32 | 2013-11-06 17:45:54 | 2013-10-31 13:22:32 | 0.187500 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 14106 | C42181 | 2013-10-29 | CustomizedEmail | 11:53:36 | yes | no | 2013-10-29 11:53:36 | 53.0 | 47 | 2013-10-29 11:53:36 | 2011-08-14 10:06:08 | 17.170213 |
| 14107 | C42182 | 2013-11-28 | CustomizedEmail | 17:51:08 | no | no | 2013-11-28 17:51:08 | 2.0 | 49 | 2013-11-28 17:51:08 | 2012-08-31 10:32:00 | 9.265306 |
| 14108 | C42183 | 2017-08-13 | CustomizedEmail | 17:43:26 | no | no | 2017-08-13 17:43:26 | 0.0 | 41 | 2017-08-13 17:43:26 | 2017-08-07 11:18:38 | 0.146341 |
| 14109 | C42185 | 2011-12-30 | CustomizedEmail | 13:52:52 | no | no | 2011-12-30 13:52:52 | 2.0 | 43 | 2011-12-30 13:52:52 | 2011-03-05 11:01:04 | 6.976744 |
| 14110 | C42189 | 2013-11-05 | CustomizedEmail | 16:32:30 | yes | no | 2013-11-05 16:32:30 | 1.0 | 50 | 2013-11-05 16:32:30 | 2013-01-20 16:07:58 | 5.780000 |
14111 rows × 12 columns
#merged_data.shape
#merged_data["CustomerID"].nunique()
#merged_data.columns
merged_data=merged_data.drop(['DateOfemail','timestamp'],axis=1)
merged_data.shape
(14111, 10)
merged_data3.shape
(9709, 10)
mail_data= pd.concat([merged_data3,merged_data])
mail_data.head()
| CustomerID | EmailType | MailOpened | MailClicked | emailDatetime | time_between_emails | numofemails | lastemailDatetime | firstemailDatetime | average_time_between_mails | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | C12117 | WelcomeEmail | yes | NotApplicable | 2013-12-05 15:59:30 | 0.0 | 1 | 2013-12-05 15:59:30 | 2013-12-05 15:59:30 | 0.0 |
| 1 | C12118 | WelcomeEmail | yes | NotApplicable | 2013-09-10 14:49:14 | 0.0 | 1 | 2013-09-10 14:49:14 | 2013-09-10 14:49:14 | 0.0 |
| 2 | C12120 | WelcomeEmail | yes | NotApplicable | 2014-01-10 11:40:10 | 0.0 | 1 | 2014-01-10 11:40:10 | 2014-01-10 11:40:10 | 0.0 |
| 3 | C12126 | WelcomeEmail | yes | NotApplicable | 2013-09-13 12:56:02 | 0.0 | 1 | 2013-09-13 12:56:02 | 2013-09-13 12:56:02 | 0.0 |
| 4 | C12137 | WelcomeEmail | yes | NotApplicable | 2017-10-05 13:53:38 | 0.0 | 1 | 2017-10-05 13:53:38 | 2017-10-05 13:53:38 | 0.0 |
mail_data=mail_data.drop(["time_between_emails"],axis=1)
now we have to merge all files into single dataframe
train=customer.merge(train, on="CustomerID").merge(mail_data, on="CustomerID").merge(transaction_data, on="CustomerID").merge(mail_open_counts, on="CustomerID").merge(MailClicked_counts, on="CustomerID")
train.shape
(23820, 42)
Preprocessing steps:-
1.Explaining problem statment
2.Data structure and content
3.Exploratory data analysis
4.Visual Analysis of data
5.Extracting insights
1)PROBLEM STATMENT EXPLANATION:- Predicting the retail customer retention potential level as high/medium/low & Customer Segmentation & Pattern extraction. Retail customer retention potential level refers to the likelihood of customers returning to make additional purchases from a retail business. This potential can be categorized into three levels: high, medium, and low. High potential customers are those who are likely to make repeat purchases and remain loyal to the brand. Medium potential customers may make additional purchases but are not as likely to remain loyal. Low potential customers are unlikely to make additional purchases or remain loyal to the brand.
2)DATA STRUCTURE AND CONTENT
#To veiw first five rows and five columns
train.head()
| CustomerID | City | DateOfRegistration | OnlineCommunication | AutomaticRefill | DoorstepDelivery | PreferredDeliveryDay | RetentionPotential | EmailType | MailOpened | MailClicked | emailDatetime | numofemails | lastemailDatetime | firstemailDatetime | average_time_between_mails | OrderQuantity | last_transaction_date | num_transactions | totalorders | minimumorderquantity | maximumorderquantity | first_transaction_date | average_time_between_purchase | averageorders | Totalorders-jan | Totalorders-feb | Totalorders-march | Totalorders-april | Totalorders-may | Totalorders-june | Totalorders-july | Totalorders-august | Totalorders-september | Totalorders-october | Totalorders-november | Totalorders-december | Totalmails-notoponed | Totalmails-oponed | Notapplicableclickedmails | timesclicked-yes | timesclicked-no | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | C16368 | CITY2 | 2012-09-28 | 0 | 0 | 0 | Monday | High | CustomizedEmail | yes | no | 2013-07-25 12:46:04 | 29 | 2013-07-25 12:46:04 | 2012-10-07 17:29:40 | 10.034483 | 15 | 2013-08-11 15:17:52 | 1 | 15 | 15 | 15 | 2013-08-11 15:17:52 | 0.000000 | 15 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 15 | 0 | 0 | 0 | 0 | 0 | 30 | 1 | 28 | 1 |
| 1 | C20345 | CITY2 | 2010-12-19 | 1 | 1 | 1 | Friday | Low | CustomizedEmail | yes | no | 2013-12-19 14:28:06 | 95 | 2013-12-19 14:28:06 | 2010-12-21 10:43:32 | 11.515789 | 84 | 2014-01-19 18:11:30 | 186 | 15614 | 61 | 108 | 2011-04-01 19:50:00 | 5.505376 | 83 | 671 | 1145 | 696 | 1303 | 1516 | 1523 | 1679 | 1675 | 1314 | 1140 | 1321 | 1631 | 7 | 89 | 1 | 85 | 10 |
| 2 | C17683 | CITY2 | 2010-10-03 | 0 | 0 | 0 | Wednesday | Medium | WelcomeEmail | yes | NotApplicable | 2010-10-03 17:12:24 | 1 | 2010-10-03 17:12:24 | 2010-10-03 17:12:24 | 0.000000 | 34 | 2011-07-06 18:12:10 | 13 | 443 | 27 | 51 | 2010-12-01 11:38:20 | 16.692308 | 34 | 68 | 0 | 64 | 70 | 64 | 80 | 34 | 0 | 0 | 0 | 0 | 63 | 0 | 1 | 1 | 0 | 0 |
| 3 | C18647 | CITY4 | 2010-11-27 | 0 | 0 | 0 | Monday | Low | CustomizedEmail | yes | no | 2012-12-17 11:00:04 | 30 | 2012-12-17 11:00:04 | 2011-02-10 16:19:20 | 22.533333 | 111 | 2013-01-28 17:04:46 | 7 | 748 | 92 | 117 | 2010-11-29 15:23:52 | 113.000000 | 106 | 111 | 114 | 0 | 0 | 216 | 0 | 0 | 105 | 0 | 0 | 110 | 92 | 3 | 28 | 1 | 26 | 4 |
| 4 | C36692 | CITY2 | 2009-12-04 | 1 | 1 | 0 | Friday | Low | CustomizedEmail | no | no | 2014-01-16 13:34:52 | 64 | 2014-01-16 13:34:52 | 2009-12-24 10:59:42 | 23.187500 | 69 | 2014-01-16 11:21:46 | 34 | 2350 | 55 | 83 | 2011-09-21 19:54:50 | 24.941176 | 69 | 69 | 148 | 211 | 279 | 131 | 249 | 137 | 214 | 294 | 131 | 291 | 196 | 46 | 19 | 1 | 54 | 10 |
#to know how many rows and columns are present
train.shape
(23820, 42)
#understanding datatypes of each column
train.dtypes
CustomerID object City object DateOfRegistration datetime64[ns] OnlineCommunication int64 AutomaticRefill int64 DoorstepDelivery int64 PreferredDeliveryDay object RetentionPotential object EmailType object MailOpened object MailClicked object emailDatetime datetime64[ns] numofemails int64 lastemailDatetime datetime64[ns] firstemailDatetime datetime64[ns] average_time_between_mails float64 OrderQuantity int64 last_transaction_date datetime64[ns] num_transactions int64 totalorders int64 minimumorderquantity int64 maximumorderquantity int64 first_transaction_date datetime64[ns] average_time_between_purchase float64 averageorders int64 Totalorders-jan int64 Totalorders-feb int64 Totalorders-march int64 Totalorders-april int64 Totalorders-may int64 Totalorders-june int64 Totalorders-july int64 Totalorders-august int64 Totalorders-september int64 Totalorders-october int64 Totalorders-november int64 Totalorders-december int64 Totalmails-notoponed int64 Totalmails-oponed int64 Notapplicableclickedmails int64 timesclicked-yes int64 timesclicked-no int64 dtype: object
#checking for unique values in each column such that we could whether there are any incorrect dtypes
for i in train.columns:
print("no of unique values and dtype:","|",i,"|",train[i].nunique(),"|",train[i].dtype)
no of unique values and dtype: | CustomerID | 23820 | object no of unique values and dtype: | City | 4 | object no of unique values and dtype: | DateOfRegistration | 2673 | datetime64[ns] no of unique values and dtype: | OnlineCommunication | 2 | int64 no of unique values and dtype: | AutomaticRefill | 2 | int64 no of unique values and dtype: | DoorstepDelivery | 2 | int64 no of unique values and dtype: | PreferredDeliveryDay | 7 | object no of unique values and dtype: | RetentionPotential | 3 | object no of unique values and dtype: | EmailType | 2 | object no of unique values and dtype: | MailOpened | 2 | object no of unique values and dtype: | MailClicked | 3 | object no of unique values and dtype: | emailDatetime | 23771 | datetime64[ns] no of unique values and dtype: | numofemails | 83 | int64 no of unique values and dtype: | lastemailDatetime | 23771 | datetime64[ns] no of unique values and dtype: | firstemailDatetime | 23789 | datetime64[ns] no of unique values and dtype: | average_time_between_mails | 5669 | float64 no of unique values and dtype: | OrderQuantity | 310 | int64 no of unique values and dtype: | last_transaction_date | 23792 | datetime64[ns] no of unique values and dtype: | num_transactions | 150 | int64 no of unique values and dtype: | totalorders | 2206 | int64 no of unique values and dtype: | minimumorderquantity | 304 | int64 no of unique values and dtype: | maximumorderquantity | 325 | int64 no of unique values and dtype: | first_transaction_date | 23798 | datetime64[ns] no of unique values and dtype: | average_time_between_purchase | 3069 | float64 no of unique values and dtype: | averageorders | 307 | int64 no of unique values and dtype: | Totalorders-jan | 693 | int64 no of unique values and dtype: | Totalorders-feb | 595 | int64 no of unique values and dtype: | Totalorders-march | 621 | int64 no of unique values and dtype: | Totalorders-april | 628 | int64 no of unique values and dtype: | Totalorders-may | 678 | int64 no of unique values and dtype: | Totalorders-june | 659 | int64 no of unique values and dtype: | Totalorders-july | 693 | int64 no of unique values and dtype: | Totalorders-august | 668 | int64 no of unique values and dtype: | Totalorders-september | 697 | int64 no of unique values and dtype: | Totalorders-october | 728 | int64 no of unique values and dtype: | Totalorders-november | 726 | int64 no of unique values and dtype: | Totalorders-december | 749 | int64 no of unique values and dtype: | Totalmails-notoponed | 70 | int64 no of unique values and dtype: | Totalmails-oponed | 61 | int64 no of unique values and dtype: | Notapplicableclickedmails | 1 | int64 no of unique values and dtype: | timesclicked-yes | 80 | int64 no of unique values and dtype: | timesclicked-no | 30 | int64
fig, ax = plt.subplots(figsize=(10, 3))
# Iterating over the columns
for col in train.columns:
# Calculate the number of unique values in the column
num_unique = len(train[col].unique())
# Plot a bar for each column
ax.bar(col, num_unique)
# Customize the chart
ax.set_xlabel('Column Name')
ax.set_ylabel('Number of Unique Values')
ax.set_title('Unique Values in Each Column')
plt.xticks(rotation=90)
# Show the chart
plt.show()
if we observe here Online comminication,automatic refill,DoorstepDelivery are labelled incorrect dtypes,so we have to convert these datatypes
#converting dtypes of the following columns
train["OnlineCommunication"]=train["OnlineCommunication"].astype("object")
train["DoorstepDelivery"]=train["DoorstepDelivery"].astype("object")
train["AutomaticRefill"]=train["AutomaticRefill"].astype("object")
# Basic Info of the Data Frame
train.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 23820 entries, 0 to 23819 Data columns (total 42 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CustomerID 23820 non-null object 1 City 23820 non-null object 2 DateOfRegistration 23820 non-null datetime64[ns] 3 OnlineCommunication 23820 non-null object 4 AutomaticRefill 23820 non-null object 5 DoorstepDelivery 23820 non-null object 6 PreferredDeliveryDay 23820 non-null object 7 RetentionPotential 23820 non-null object 8 EmailType 23820 non-null object 9 MailOpened 23820 non-null object 10 MailClicked 23820 non-null object 11 emailDatetime 23820 non-null datetime64[ns] 12 numofemails 23820 non-null int64 13 lastemailDatetime 23820 non-null datetime64[ns] 14 firstemailDatetime 23820 non-null datetime64[ns] 15 average_time_between_mails 23820 non-null float64 16 OrderQuantity 23820 non-null int64 17 last_transaction_date 23820 non-null datetime64[ns] 18 num_transactions 23820 non-null int64 19 totalorders 23820 non-null int64 20 minimumorderquantity 23820 non-null int64 21 maximumorderquantity 23820 non-null int64 22 first_transaction_date 23820 non-null datetime64[ns] 23 average_time_between_purchase 23820 non-null float64 24 averageorders 23820 non-null int64 25 Totalorders-jan 23820 non-null int64 26 Totalorders-feb 23820 non-null int64 27 Totalorders-march 23820 non-null int64 28 Totalorders-april 23820 non-null int64 29 Totalorders-may 23820 non-null int64 30 Totalorders-june 23820 non-null int64 31 Totalorders-july 23820 non-null int64 32 Totalorders-august 23820 non-null int64 33 Totalorders-september 23820 non-null int64 34 Totalorders-october 23820 non-null int64 35 Totalorders-november 23820 non-null int64 36 Totalorders-december 23820 non-null int64 37 Totalmails-notoponed 23820 non-null int64 38 Totalmails-oponed 23820 non-null int64 39 Notapplicableclickedmails 23820 non-null int64 40 timesclicked-yes 23820 non-null int64 41 timesclicked-no 23820 non-null int64 dtypes: datetime64[ns](6), float64(2), int64(24), object(10) memory usage: 7.8+ MB
if we observe here we dont have problem of null values here,so no imputation is required as of now
we can get new columns from existing which might be useful for our analysis:
1)difference between our last_transaction_date and first_transaction_date to know a customer interaction or to know how long a customer has been purchasing in the platform .
2)difference between first email and last email to know what is email_duration.email_duration refers to the time period between the first and last email of a particular campaign or sequence.For example, if the email_duration is too short, it may suggest that you are not giving your audience enough time to engage with your content. Conversely, if the cadence is too long, it may indicate that your emails are not engaging enough or that you are not providing enough value to your subscribers.
3)difference between date of regestration and last transaction data also helps us to know how old or new our customer.By subtracting the registration date from the last transaction date, you can determine the total duration of the customer's relationship with the company
train["days_since_first_transaction"]=(train['last_transaction_date'] - train['first_transaction_date']).dt.days
train["email_duration"]=(train['lastemailDatetime']- train['firstemailDatetime']).dt.days
train["duration_of_customer_relation"]=(train['last_transaction_date'] - train['DateOfRegistration']).dt.days
4)the time elapsed since the customer's last transaction.The resulting values will be a measure of the duration of time since each customer's last transaction.Here are some reasons why calculating the time elapsed since the customer's last transaction is important: Identify at-risk customers: If a customer has not made a purchase in a long time, they may be at risk of churn. By tracking the time elapsed since their last transaction, you can identify at-risk customers and take steps to retain them.Measure customer engagement: Customers who make frequent purchases are generally more engaged with your products or services. By measuring the time elapsed since their last transaction, you can get a sense of how engaged customers are and identify opportunities to increase engagement.Develop targeted marketing campaigns: Customers who have not made a purchase in a while may be more likely to respond to targeted marketing campaigns. By segmenting customers based on the time elapsed since their last transaction, you can develop campaigns that are tailored to their specific needs and interests.Overall, calculating the time elapsed since the customer's last transaction is a valuable metric for understanding customer behavior and developing effective marketing and retention strategies.
train["recency"]=(train["last_transaction_date"].max()-train["last_transaction_date"]).dt.days
We will drop few columns such as first transaction date,first email date as we kept those columns only to identify the difference between the latest one and them,no we will drop this columns
#train=train.drop(["first_transaction_date","firstemailDatetime","emailDatetime"],axis=1)
1)if we observe here if first transaction is last transaction,then difference will be coming out as 0 but it would be irrelevant to context of our data.In such cases either we should not consider the rows with zeroes,or we have to bin this columns as categorical columns. 2)if we observe here if first email is last email,then difference will be coming out as 0 in email cadence column but it would be irrelevant to context of our data.In such cases either we should not consider the rows with zeroes,or we have to bin this columns as categorical columns. so we will bin this two columns.
bins = pd.cut(train["days_since_first_transaction"], [-1, 1, 10, 100, train["days_since_first_transaction"].max()], labels=["Only transaction", "Recent Transactions", "Moderate History", "Long-term History"])
train["days_since_first_transaction_category"] = bins
train["days_since_first_transaction_category"].value_counts()
Only transaction 15013 Long-term History 4111 Moderate History 3989 Recent Transactions 707 Name: days_since_first_transaction_category, dtype: int64
bins = pd.cut(train["email_duration"], [-1, 1, 10, 100, train["email_duration"].max()], labels=["Only email", "Medium history", "long mail history", "very Long mail History"])
train["email_duration_category"] = bins
train["email_duration_category"].value_counts()
Only email 10954 long mail history 6026 very Long mail History 4960 Medium history 1880 Name: email_duration_category, dtype: int64
train["email_duration_category"]=train["email_duration_category"].astype("object")
train["days_since_first_transaction_category"]=train["days_since_first_transaction_category"].astype("object")
#as we had already binned this columns so we will drop them
#train=train.drop(["days_since_first_transaction","email_duration"],axis=1)
#understanding statstical info of our data
train.describe()
| numofemails | average_time_between_mails | OrderQuantity | num_transactions | totalorders | minimumorderquantity | maximumorderquantity | average_time_between_purchase | averageorders | Totalorders-jan | Totalorders-feb | Totalorders-march | Totalorders-april | Totalorders-may | Totalorders-june | Totalorders-july | Totalorders-august | Totalorders-september | Totalorders-october | Totalorders-november | Totalorders-december | Totalmails-notoponed | Totalmails-oponed | Notapplicableclickedmails | timesclicked-yes | timesclicked-no | days_since_first_transaction | email_duration | duration_of_customer_relation | recency | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 23820.000000 | 23820.000000 | 23820.000000 | 23820.000000 | 23820.000000 | 23820.000000 | 23820.000000 | 23820.000000 | 23820.000000 | 23820.000000 | 23820.000000 | 23820.000000 | 23820.000000 | 23820.00000 | 23820.000000 | 23820.000000 | 23820.000000 | 23820.000000 | 23820.000000 | 23820.000000 | 23820.000000 | 23820.000000 | 23820.000000 | 23820.0 | 23820.000000 | 23820.000000 | 23820.000000 | 23820.000000 | 23820.000000 | 23820.000000 |
| mean | 21.140260 | 3.318651 | 62.107976 | 4.397775 | 357.622922 | 58.586608 | 65.653778 | 12.527876 | 61.931444 | 32.711545 | 19.269270 | 21.799874 | 22.378757 | 26.32351 | 27.053904 | 27.685516 | 28.474349 | 35.221662 | 39.969941 | 40.028128 | 36.706465 | 15.525273 | 6.207389 | 1.0 | 19.563854 | 1.168808 | 86.753568 | 105.719563 | 129.325441 | 1426.424223 |
| std | 19.620475 | 12.121427 | 38.729691 | 12.142475 | 1335.649680 | 36.428571 | 41.270385 | 36.102332 | 38.318825 | 128.808814 | 103.623359 | 114.836843 | 113.052766 | 125.30813 | 127.011724 | 129.660528 | 130.733997 | 132.849129 | 138.678043 | 134.577762 | 138.087054 | 17.239732 | 8.517405 | 0.0 | 19.168208 | 2.376062 | 220.450104 | 241.409428 | 266.695842 | 675.530650 |
| min | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 1.000000 | 0.000000 | 41.000000 | 1.000000 | 41.000000 | 40.000000 | 41.000000 | 0.000000 | 41.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1469.000000 |
| 50% | 19.000000 | 0.210526 | 51.000000 | 1.000000 | 65.000000 | 48.000000 | 53.000000 | 0.000000 | 51.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 7.000000 | 2.000000 | 1.0 | 17.000000 | 0.000000 | 0.000000 | 6.000000 | 16.000000 | 1543.000000 |
| 75% | 40.000000 | 2.116279 | 75.000000 | 3.000000 | 169.000000 | 69.000000 | 80.000000 | 10.000000 | 74.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 10.250000 | 41.000000 | 40.000000 | 0.000000 | 32.000000 | 8.000000 | 1.0 | 38.000000 | 1.000000 | 41.000000 | 69.000000 | 102.000000 | 1663.000000 |
| max | 291.000000 | 561.000000 | 1501.000000 | 257.000000 | 51429.000000 | 1501.000000 | 1501.000000 | 807.500000 | 1501.000000 | 3788.000000 | 3198.000000 | 3670.000000 | 3679.000000 | 6114.00000 | 4939.000000 | 4035.000000 | 5179.000000 | 3358.000000 | 4845.000000 | 5461.000000 | 4408.000000 | 91.000000 | 256.000000 | 1.0 | 249.000000 | 42.000000 | 1984.000000 | 1963.000000 | 1998.000000 | 3463.000000 |
#train.dtypes
if we observe we have extreme value present in each of column.so we will think whther to replace those columns or transform them on basis of boxplot we will get
#sepreating numerical and categorical columns
numericalcolumns=[]
categoricalcolumns=[]
datetypecolumns=[]
for i in train.columns:
if train[i].dtypes=="object":
categoricalcolumns.append(i)
elif train[i].dtypes=="datetime64[ns]":
datetypecolumns.append(i)
else:
numericalcolumns.append(i)
print(numericalcolumns)
print(categoricalcolumns)
print(datetypecolumns)
['numofemails', 'average_time_between_mails', 'OrderQuantity', 'num_transactions', 'totalorders', 'minimumorderquantity', 'maximumorderquantity', 'average_time_between_purchase', 'averageorders', 'Totalorders-jan', 'Totalorders-feb', 'Totalorders-march', 'Totalorders-april', 'Totalorders-may', 'Totalorders-june', 'Totalorders-july', 'Totalorders-august', 'Totalorders-september', 'Totalorders-october', 'Totalorders-november', 'Totalorders-december', 'Totalmails-notoponed', 'Totalmails-oponed', 'Notapplicableclickedmails', 'timesclicked-yes', 'timesclicked-no', 'days_since_first_transaction', 'email_duration', 'duration_of_customer_relation', 'recency'] ['CustomerID', 'City', 'OnlineCommunication', 'AutomaticRefill', 'DoorstepDelivery', 'PreferredDeliveryDay', 'RetentionPotential', 'EmailType', 'MailOpened', 'MailClicked', 'days_since_first_transaction_category', 'email_duration_category'] ['DateOfRegistration', 'emailDatetime', 'lastemailDatetime', 'firstemailDatetime', 'last_transaction_date', 'first_transaction_date']
3.Data Quality Check And Missing Values
3.1 Percentage of missing values for columns
# checking column-wise null percentages
cols = pd.DataFrame(train.isnull().mean().round(4) * 100, columns =['precentage_missing_value']).sort_values(by=['precentage_missing_value'])
print(cols)
precentage_missing_value CustomerID 0.0 Totalorders-feb 0.0 Totalorders-march 0.0 Totalorders-april 0.0 Totalorders-may 0.0 Totalorders-june 0.0 Totalorders-july 0.0 Totalorders-august 0.0 Totalorders-september 0.0 Totalorders-october 0.0 Totalorders-jan 0.0 Totalorders-november 0.0 Totalmails-notoponed 0.0 Totalmails-oponed 0.0 Notapplicableclickedmails 0.0 timesclicked-yes 0.0 timesclicked-no 0.0 days_since_first_transaction 0.0 email_duration 0.0 duration_of_customer_relation 0.0 recency 0.0 Totalorders-december 0.0 averageorders 0.0 average_time_between_purchase 0.0 first_transaction_date 0.0 City 0.0 DateOfRegistration 0.0 OnlineCommunication 0.0 AutomaticRefill 0.0 DoorstepDelivery 0.0 PreferredDeliveryDay 0.0 RetentionPotential 0.0 EmailType 0.0 MailOpened 0.0 MailClicked 0.0 emailDatetime 0.0 numofemails 0.0 lastemailDatetime 0.0 firstemailDatetime 0.0 average_time_between_mails 0.0 OrderQuantity 0.0 last_transaction_date 0.0 num_transactions 0.0 totalorders 0.0 minimumorderquantity 0.0 maximumorderquantity 0.0 days_since_first_transaction_category 0.0 email_duration_category 0.0
Though there are no missing values in columns such as average_time_between_mails,average_time_between_purchase but there cannot be 0s in rows of those columns so we will have to replace them either with mean,median by checking distributions(dist plot) or bin this columns with few labels
#replacing the 0's with either mean or median by checking their distributions
sns.distplot(train["average_time_between_mails"])
#if we see the distribution it is not normally distributed ,it is right skewed distribution and even it is not clear.we will replace here with median as it is not affected by outliers
<Axes: xlabel='average_time_between_mails', ylabel='Density'>
#replacing the 0's with either mean or median by checking their distributions
sns.distplot(train["average_time_between_purchase"])
#if we see the distribution it is not normally distributed ,it is right skewed distribution and even it is not clear.we will replace here with median as it is not affected by outliers
<Axes: xlabel='average_time_between_purchase', ylabel='Density'>
#train.columns
#1)for column average_time_between_mails
# replace zeros with NaN
train["average_time_between_mails"].replace(0, pd.np.nan, inplace=True)
# replace NaN values with the median of the column
median =train["average_time_between_mails"].median()
train["average_time_between_mails"].fillna(median, inplace=True)
#2)for column average_time_between_purchase
# replace zeros with NaN
train["average_time_between_purchase"].replace(0, pd.np.nan, inplace=True)
# replace NaN values with the median of the column
median =train["average_time_between_purchase"].median()
train["average_time_between_purchase"].fillna(median, inplace=True)
# Check total duplicated values
train.duplicated().sum()
0
total_orders = {}
for month in order_counts.columns:
if month.startswith('Totalorders-'):
total_orders[month] = order_counts[month].sum()
total_orders
# Create a list of months and a list of total orders
months = list(total_orders.keys())
orders = list(total_orders.values())
# Create a bar plot
plt.figure(figsize=(5,5))
plt.bar(months, orders)
plt.xticks(rotation=90)
# Set the title and axis labels
plt.title('Total Orders by Month')
plt.xlabel('Month')
plt.ylabel('Total Orders')
# Show the plot
plt.show()
Orders were more than 200000 in jan,september,october,november,december,and less than 200000 in other months
train["Totalorderquantity_in_first_quarter"]=train["Totalorders-jan"]+train["Totalorders-feb"]+train["Totalorders-march"]++train["Totalorders-april"]
train["Totalorderquantity_in_second_quarter"]=train["Totalorders-may"]+train["Totalorders-june"]+train["Totalorders-july"]++train["Totalorders-august"]
train["Totalorderquantity_in_Third_quarter"]=train["Totalorders-december"]+train["Totalorders-september"]+train["Totalorders-october"]++train["Totalorders-november"]
train=train.drop(['Totalorders-jan', 'Totalorders-feb',
'Totalorders-march', 'Totalorders-april', 'Totalorders-may',
'Totalorders-june', 'Totalorders-july', 'Totalorders-august',
'Totalorders-september', 'Totalorders-october', 'Totalorders-november',
'Totalorders-december'],axis=1)
checking categorical columns
train["City"].value_counts()
CITY4 9143 CITY2 6879 CITY3 6490 CITY1 1308 Name: City, dtype: int64
train["OnlineCommunication"].value_counts()
# map OnlineCommunication to new categories
new_categories = {
0: 'no',
1: 'yes',
}
train['OnlineCommunication'] = train['OnlineCommunication'].map(new_categories)
# print the resulting dataframe
print(train['OnlineCommunication'].value_counts())
yes 15731 no 8089 Name: OnlineCommunication, dtype: int64
#train["AutomaticRefill"].value_counts()
# map AutomaticRefill to new categories
new_categories = {
0: 'no',
1: 'yes',
}
train['AutomaticRefill'] = train['AutomaticRefill'].map(new_categories)
# print the resulting dataframe
print(train['AutomaticRefill'].value_counts())
no 21564 yes 2256 Name: AutomaticRefill, dtype: int64
#train["DoorstepDelivery"].value_counts()
# map DoorstepDelivery to new categories
new_categories = {
0: 'no',
1: 'yes',
}
train['DoorstepDelivery'] = train['DoorstepDelivery'].map(new_categories)
# print the resulting dataframe
print(train['DoorstepDelivery'].value_counts())
no 22924 yes 896 Name: DoorstepDelivery, dtype: int64
train["PreferredDeliveryDay"].value_counts()
# map PreferredDeliveryDay to new categories
new_categories = {
'Monday': 'Monday',
'Tuesday':'Tuesday',
'Wednesday':'Wednesday',
'Thursday':'Thursday',
'Friday':'Friday',
'Saturday':'weekend',
'Sunday':'weekend'
}
train['PreferredDeliveryDay'] = train['PreferredDeliveryDay'].map(new_categories)
# print the resulting dataframe
print(train['PreferredDeliveryDay'].value_counts())
Tuesday 5343 Monday 5300 Friday 4167 Thursday 4043 Wednesday 3778 weekend 1189 Name: PreferredDeliveryDay, dtype: int64
train["RetentionPotential"].value_counts()
Low 19185 High 3756 Medium 879 Name: RetentionPotential, dtype: int64
train["EmailType"].value_counts()
CustomizedEmail 14111 WelcomeEmail 9709 Name: EmailType, dtype: int64
train["MailOpened"].value_counts()
yes 13749 no 10071 Name: MailOpened, dtype: int64
train["MailClicked"].value_counts()
no 13191 NotApplicable 9709 yes 920 Name: MailClicked, dtype: int64
train["Notapplicableclickedmails"].value_counts()#there is only one value present in whole hence we can remove this column.
train=train.drop(["Notapplicableclickedmails"],axis=1)
for i in categoricalcolumns:
train[i]=train[i].astype("category")
train[categoricalcolumns].dtypes
CustomerID category City category OnlineCommunication category AutomaticRefill category DoorstepDelivery category PreferredDeliveryDay category RetentionPotential category EmailType category MailOpened category MailClicked category days_since_first_transaction_category category email_duration_category category dtype: object
#sepreating numerical and categorical columns
numericalcolumns=[]
categoricalcolumns=[]
datetypecolumns=[]
for i in train.columns:
if train[i].dtypes=="category":
categoricalcolumns.append(i)
elif train[i].dtypes=="datetime64[ns]":
datetypecolumns.append(i)
else:
numericalcolumns.append(i)
numericalcolumns
['numofemails', 'average_time_between_mails', 'OrderQuantity', 'num_transactions', 'totalorders', 'minimumorderquantity', 'maximumorderquantity', 'average_time_between_purchase', 'averageorders', 'Totalmails-notoponed', 'Totalmails-oponed', 'timesclicked-yes', 'timesclicked-no', 'days_since_first_transaction', 'email_duration', 'duration_of_customer_relation', 'recency', 'Totalorderquantity_in_first_quarter', 'Totalorderquantity_in_second_quarter', 'Totalorderquantity_in_Third_quarter']
outlier analysis for numerical columns
numcols=["OrderQuantity",'numofemails','num_transactions','days_since_first_transaction','recency']
for i in numcols:
plt.figure(figsize=(6,6))
sns.boxplot(train[i])
plt.title("Boxplot of " + i)
# calculate quartiles and upper/lower bounds
q1 = train[i].quantile(0.25)
q2 = train[i].quantile(0.5)
q3 = train[i].quantile(0.75)
iqr = q3 - q1
upper = q3 + 1.5*iqr
lower = q1 - 1.5*iqr
# mark quartiles and upper/lower bounds on the plot
plt.axhline(q1, color='r', linestyle='--', label='Q1')
plt.axhline(q2, color='g', linestyle='-', label='Q2 (Median)')
plt.axhline(q3, color='b', linestyle='--', label='Q3')
plt.axhline(upper, color='gray', linestyle='--', label='Upper Bound')
plt.axhline(lower, color='gray', linestyle='--', label='Lower Bound')
plt.legend()
plt.show()
#percentage of outliers present in each column
for i in numcols:
IQR = (train[i]).quantile(0.75) - (train[i]).quantile(0.25)
Upper_limit = (train[i]).quantile(0.75) + IQR*1.5
lower_limit = (train[i]).quantile(0.25) - IQR*1.5
print("|",i,"|",round(100.0 * len(train[(train[i])>Upper_limit])/len(train), 2)+round(100.0 * len(train[(train[i])<lower_limit])/len(train), 2))
| OrderQuantity | 5.8 | numofemails | 0.02 | num_transactions | 12.09 | days_since_first_transaction | 17.11 | recency | 34.28
Here we are not going to replace outliers because they are not missing at random and they might have significant importance.so we will not remove them because they might be important and useful for our model building.
grouped_data = train.groupby("City").size().reset_index(name="Number of Customers")
import matplotlib.pyplot as plt
# define colors for each bar
colors = ['#fc4f30', '#e5ae37', '#6d904f', '#a27ea8']
# plot bar chart
grouped_data.plot(kind='bar', x='City', y='Number of Customers',color=colors)
# set chart title and axis labels
plt.title('Number of Customers per City')
plt.xlabel('City')
plt.ylabel('Number of Customers')
# show plot
plt.show()
For city 1 if we see number of customers per city are far too less compared to city 2,city3 and city4,from this we can draw out few conclusions which might or might not be true
1)More the no. of customers,there more the focus should be as there is higher potenial for customer retention,further analysis might reveal,status of retention potenial in those cities.
2)lesser the customers more marketing and schemes should be devloped there such that more customers are introduced.
# group the data by city and retention potenial, and count the number of customers for each city
counts = train.groupby(['City','RetentionPotential']).size().reset_index(name="Number of Customers")
counts
| City | RetentionPotential | Number of Customers | |
|---|---|---|---|
| 0 | CITY1 | High | 125 |
| 1 | CITY1 | Low | 1155 |
| 2 | CITY1 | Medium | 28 |
| 3 | CITY2 | High | 1257 |
| 4 | CITY2 | Low | 5168 |
| 5 | CITY2 | Medium | 454 |
| 6 | CITY3 | High | 866 |
| 7 | CITY3 | Low | 5503 |
| 8 | CITY3 | Medium | 121 |
| 9 | CITY4 | High | 1508 |
| 10 | CITY4 | Low | 7359 |
| 11 | CITY4 | Medium | 276 |
1)If we observe here city 2 has highest number of customers with medium retention potenial,so here we should not lose those customers aas they are valuable for our analysis
2)City 4 is having 7359 customers with low retention potenial.Their risk of churning is very higher compared to other cities.so,reaccesment should take place such that churning do not happen
# Group the data by city and calculate the total orders by that city customers,number of transactions did by that city customers, and comparing number of transactions each quarter
data2 = train.groupby('City').agg({'CustomerID':'count','totalorders':['sum','mean'],'num_transactions':'sum','Totalorderquantity_in_first_quarter':'sum','Totalorderquantity_in_second_quarter':'sum','Totalorderquantity_in_Third_quarter':'sum'})
data2.head()
| CustomerID | totalorders | num_transactions | Totalorderquantity_in_first_quarter | Totalorderquantity_in_second_quarter | Totalorderquantity_in_Third_quarter | ||
|---|---|---|---|---|---|---|---|
| count | sum | mean | sum | sum | sum | sum | |
| City | |||||||
| CITY1 | 1308 | 261920 | 200.244648 | 3632 | 52923 | 70081 | 138916 |
| CITY2 | 6879 | 4380420 | 636.781509 | 48860 | 1296802 | 1439515 | 1644103 |
| CITY3 | 6490 | 1098124 | 169.202465 | 15700 | 234813 | 222573 | 640738 |
| CITY4 | 9143 | 2778114 | 303.851471 | 36563 | 705980 | 877009 | 1195125 |
1)From the groupby operation we had done above we can find that city 2 is having 6879 customers but they are having large ammount of customers who had ordered so much with total orders and num of transactions summing upto 4380420 orders in 48860 transactions,with order quantity in each quarter growing up.Also we have a clear evidence from the plot above this that there are nearly 1700 customers who has retention potenial of high and medium.So customers from city 2 would add great value to our analysis.
2)Similarly for city 4,if we observe from our previous plot we can see that risk of churning is higher in city 4 was higher compared to others,but customer with retention potenial of medium and high are also more,if we number of orders and number of transactions were second highest to other cities,and quarter wise its been improving
3)Average order value from city1 is higher than city3 and there are less customers in city 1 compared to others,and city 3 has lesser orderquantity during 2nd quarter of a year
#checking number of customers registered over years
plt.figure(figsize=(5,5))
train['DateOfRegistration'] = pd.to_datetime(train['DateOfRegistration'])
train['RegistrationYear'] = train['DateOfRegistration'].dt.year
yearly_registrations = train.groupby('RegistrationYear').count()['CustomerID']
plt.plot(yearly_registrations.index, yearly_registrations.values)
plt.xlabel('Year')
plt.ylabel('Number of Registrations')
plt.title('no of customers over time')
plt.show()
yearly_registrations
RegistrationYear 2008 45 2009 349 2010 1171 2011 2765 2012 1980 2013 12009 2014 1097 2015 705 2016 533 2017 2956 2018 210 Name: CustomerID, dtype: int64
Till 2013 customers registrations,customers acquired exponentially,but after 2013 suddenly registration were dropped,and in 2016 again it increased and then dropped. Here we have to find out the reason why in year 2013, no. of registrations were dropped.Earlier,if we observed our transaction and mails visualizations it had also shown that it was good during the period of 2013-2014 but after that it had dropped,no of registrations might be one of the potenial reasons.
plt.figure(figsize=(4,4))
sns.heatmap(train[['averageorders','average_time_between_mails',
'average_time_between_purchase',
'duration_of_customer_relation',
'recency']].corr(),annot=True)
<Axes: >
**1)If we observe average_time_between_mails has positive correlation between duration_of_customer_relation**
**2)recency have no correlation at all with any of the columns**
**3)If we observe average_time_between_purchases has positive correlation between duration_of_customer_relation**
import plotly.express as px
#lets compare email duration category with mail oponed with yes or no
fig = px.histogram(train, x="MailOpened", color="email_duration_category",barmode="group", title="<b>email duration w.r.t mail oponed</b>")
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
points to be noted:-
1)On their last email,if we see only 1382 customers with very long email history has oponed mail while 3578 customers has not oponed their mails.
2)people who have received only mail,880 customers had not even oponed the mail.people who received only mail are people with welcome mails,we can understand that 880 customers were not intrested as they had not oponed.
3)544 Customers with medium history has oponed while 1336 has not oponed the mail.
4)4277 customers with long mail history has not oponed the mail while 1749 customers has not oponed..
Keypoint-2)people who have received only mail,880 customers had not even oponed the mail.people who received only mail are people with welcome mails,we can understand that 880 customers were not intrested as they had not oponed.
#lets compare OnlineCommunication category with days_since_first_transaction_category
fig = px.histogram(train, x="OnlineCommunication", color="days_since_first_transaction_category",barmode="group", title="<b>OnlineCommunication category with days_since_first_transaction_category </b>")
fig.update_layout(width=700, height=300, bargap=0.1)
fig.show()
1)Only transaction:-Customers prefer online communication more
2)Longtermhistory:-lesser Customers prefer online communication.
3)ModerateHistory:-Customers prefer online communication more.
4)RecentTransactions:-Customers prefered online communication more
Keypoint-customers with long term history are lesser likely to have online communication
#lets compare email duration category with mail oponed with yes or no
fig = px.histogram(train, x="days_since_first_transaction_category", color="RetentionPotential",barmode="group", title="<b>city w.r.t PreferredDeliveryDay</b>")
fig.update_layout(width=700, height=300, bargap=0.1)
fig.show()
Customers with moderate history,long term history have few customers with medium retention possibilty.this would be useful for our analysis as our focus is on customers with medium retention potenial because if we can concentrate on them we can improve our business,as they had already been customer since so much time we have to plan something for them such that they would remain loyal customers.
train[["minimumorderquantity","maximumorderquantity","averageorders"]].describe()
| minimumorderquantity | maximumorderquantity | averageorders | |
|---|---|---|---|
| count | 23820.000000 | 23820.000000 | 23820.000000 |
| mean | 58.586608 | 65.653778 | 61.931444 |
| std | 36.428571 | 41.270385 | 38.318825 |
| min | 0.000000 | 0.000000 | 0.000000 |
| 25% | 40.000000 | 41.000000 | 41.000000 |
| 50% | 48.000000 | 53.000000 | 51.000000 |
| 75% | 69.000000 | 80.000000 | 74.000000 |
| max | 1501.000000 | 1501.000000 | 1501.000000 |
sns.scatterplot(x=train['minimumorderquantity'], y=train['maximumorderquantity'])
plt.xlabel('Minimum Order Quantity')
plt.ylabel('Maximum Order Quantity')
plt.title('Relationship between Minimum and Maximum Order Quantity')
plt.show()
#there is a linear relationship between minimum and maximum order quantity,as it is reduntant column we can consider removing them.
There is a linear relation between this columns we can remove this columns as it will not help during our model building time,so we will not use this columns for our analysis,as we have average orders in our data so we will also check with that column.
sns.scatterplot(x=train['minimumorderquantity'], y=train['averageorders'])
plt.xlabel('Minimum Order Quantity')
plt.ylabel('averageorders Quantity')
plt.title('Relationship between Minimum and averageorders Quantity')
plt.show()
#there is a linear relationship between minimum and average order quantity,as it is reduntant column we can consider removing them.
There is a linear relation between this columns we can remove this columns as it will not help during our model building time,so we will not use this columns for our analysis,as we have average orders in our data so we will also check with that column.
# Identify most profitable customers
profitable_customers = train[train['averageorders'] > 100].reset_index()
# Optimize inventory management
large_order_customers = train[train['averageorders'] > 50]
small_order_customers = train[train['averageorders'] <= 50]
# Improve customer experience based on order quantity
train['customer_segment'] = train.apply(lambda row: 'large_order_customer' if row['OrderQuantity'] > 50 else 'small_order_customer', axis=1)
train['customer_segment']=train['customer_segment'].astype("category")
train['customer_segment']=train['customer_segment'].astype("category")
Segmenting customers based upon their average order quantities as small ordercustomers and large order customers as customer segment.
#Checking imbalance in data
# Finding % of customers with retention potential of high,low,medium.
target_high_percentage = (round (len(train.query('RetentionPotential=="High"'))/len(train),3)) * 100
print ("target_high_percentage:", target_high_percentage,"%")
target_Medium_percentage = (round (len(train.query('RetentionPotential=="Medium"'))/len(train),3)) * 100
print ("target_Medium_percentage:", target_Medium_percentage,"%")
target_Low_percentage = (round (len(train.query('RetentionPotential=="Low"'))/len(train),4)) * 100
print ("target_Low_percentage:", target_Low_percentage,"%")
target_high_percentage: 15.8 % target_Medium_percentage: 3.6999999999999997 % target_Low_percentage: 80.54 %
if we can see there is high imbalance in the data.Target low percentage is very high in proportion,while target medium percentage and target high are very low
Creation of three data sets - one for each high,medium and low
# Creating Dataframe of the retention potenial with high customers
high= train.query('RetentionPotential=="High"')
# Creating Dataframe of the retention potenial with low customers
low= train.query('RetentionPotential=="Low"')
# Creating Dataframe of the retention potenial with medium customers
medium= train.query('RetentionPotential=="Medium"')
catcolanalysis=['OnlineCommunication','AutomaticRefill','DoorstepDelivery','MailOpened','MailClicked','days_since_first_transaction_category','email_duration_category','customer_segment']
#loop for performing univariate analysis
for i in catcolanalysis:
plt.figure(figsize=(10,5))
plt.subplot(1,3,1)
high[i].value_counts(normalize=True).plot.bar()
plt.title(i+ '-high')
plt.subplot(1,3,2)
low[i].value_counts(normalize=True).plot.bar()
plt.title(i+ '- low')
plt.subplot(1,3,3)
medium[i].value_counts(normalize=True).plot.bar()
plt.title(i+ ' - medium')
Understandings from the above plot:-
the time elapsed since the customer's last transaction.The resulting values will be a measure of the duration of time since each customer's last transaction
#lets check transaction category w.r.t how recent a customer has been and segregating it by using hue as retention potenial.
sns.scatterplot(x=train["days_since_first_transaction_category"],y=train["recency"],hue=train["RetentionPotential"])
<Axes: xlabel='days_since_first_transaction_category', ylabel='recency'>
1)Customers with long term history,moderate history and there recenct transaction between 0 and 500 days, we can see there are few customers whose retention potenial is medium
2)Customers with moderate history and there recency between 2500 and 3000 days,has also a fewer customers with medium possibility of retention.
#
ax = sns.kdeplot(train.recency[(train["RetentionPotential"] == 'Medium') ],
color="Gold", shade = True);
ax = sns.kdeplot(train.recency[(train["RetentionPotential"] == 'High') ],
ax =ax, color="Green", shade= True);
ax.legend(["High","Medium"],loc='upper right');
ax.set_ylabel('Density');
ax.set_xlabel('customer recency');
ax.set_title('Distribution of customer recency by retention potenial');
it shows the relative frequency of observations in each range of customer recency.
1)0-1000 days-for target-high there are more no. of customers present
2)1000-4000 days - they are not that active customers but still they can be retained but if we observe medium customers are more,there might be a chance of them churning.
for i in range(2008,2019,2):
t=train[train["RegistrationYear"]==i]
fig = px.pie(t, values='OrderQuantity', names='RetentionPotential', title=f'order quantity every year w.r.t to registration year {i}')
fig.show()
Here we are comparing order quantities by no. of registrations every 2 years. If we observe year 2010 had been the highest ordered quantity for medium customers and high customers Intresting point is that the customers who are more likely to churn have the highest oreder quantity rate in 2018
train.columns
Index(['CustomerID', 'City', 'OnlineCommunication', 'AutomaticRefill',
'DoorstepDelivery', 'PreferredDeliveryDay', 'RetentionPotential',
'EmailType', 'MailOpened', 'MailClicked', 'numofemails',
'average_time_between_mails', 'OrderQuantity', 'num_transactions',
'sumoforderquantity', 'average_time_between_purchase',
'days_since_first_transaction', 'email_duration',
'duration_of_customer_relation', 'recency',
'days_since_first_transaction_category', 'email_duration_category',
'RegistrationYear', 'customer_segment', 'Time_to_First_Email',
'Time_since_Last_Email', 'most_ordered_quarter', 'Registration_month',
'Registration_weekday', 'lastemail_month', 'lastemail_weekday',
'transaction_month', 'transaction_weekday', 'open_rate',
'click_through_rate', 'response_rate', 'open_rate_ratio',
'click_through_rate_ratio', 'response_rate_ratio'],
dtype='object')
# create a cross-tabulation table to count the number of cases
ct = pd.crosstab(index=medium['DoorstepDelivery'],
columns=[medium['AutomaticRefill'],medium['RetentionPotential']])
# create the heatmap with annotations showing the counts
plt.figure(figsize=(5, 5))
sns.heatmap(ct, annot=True, fmt='d')
plt.show()
There are 740 customers who had not prefered doorstepdelivery and no automatic refill which there might be some underlying factors,so maybe in near future company should improve the following services.
# create a cross-tabulation table to count the number of cases
ct = pd.crosstab(index=medium['days_since_first_transaction_category'],
columns=[medium['email_duration_category'],medium['RetentionPotential']])
# create the heatmap with annotations showing the counts
plt.figure(figsize=(10, 5))
sns.heatmap(ct, annot=True, fmt='d')
plt.show()
There are 355 medium retention potenial customers with long term transaction history and only mail being sent to them.
Customers with medium retention potenial and long term transaction history are compartively more than the others,so we can target this customers and prevent them from churning
Columns such as Time_to_First_Email and Time_since_Last_Email might be useful as we can know whether a mail to customer has been sent before the transaction or after the transaction,and time since last email will help us to know last at which mails has been sent
train["Time_to_First_Email"]=(train["first_transaction_date"]-train["firstemailDatetime"]).dt.days
train["Time_to_First_Email"]=train["Time_to_First_Email"].astype("int")
train["Time_since_Last_Email"]=(train["last_transaction_date"]-train["lastemailDatetime"]).dt.days
train["Time_since_Last_Email"]=train["Time_since_Last_Email"].astype("int")
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) /usr/local/lib/python3.9/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 3801 try: -> 3802 return self._engine.get_loc(casted_key) 3803 except KeyError as err: /usr/local/lib/python3.9/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() /usr/local/lib/python3.9/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'first_transaction_date' The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) <ipython-input-220-22291da778a3> in <cell line: 1>() ----> 1 train["Time_to_First_Email"]=(train["first_transaction_date"]-train["firstemailDatetime"]).dt.days 2 train["Time_to_First_Email"]=train["Time_to_First_Email"].astype("int") 3 train["Time_to_First_Email"] = np.where(train["Time_to_First_Email"] < 0, 0, train["Time_to_First_Email"]) 4 train["Time_since_Last_Email"]=(train["last_transaction_date"]-train["lastemailDatetime"]).dt.days 5 train["Time_since_Last_Email"]=train["Time_since_Last_Email"].astype("int") /usr/local/lib/python3.9/dist-packages/pandas/core/frame.py in __getitem__(self, key) 3805 if self.columns.nlevels > 1: 3806 return self._getitem_multilevel(key) -> 3807 indexer = self.columns.get_loc(key) 3808 if is_integer(indexer): 3809 indexer = [indexer] /usr/local/lib/python3.9/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 3802 return self._engine.get_loc(casted_key) 3803 except KeyError as err: -> 3804 raise KeyError(key) from err 3805 except TypeError: 3806 # If we have a listlike key, _check_indexing_error will raise KeyError: 'first_transaction_date'
train["Time_to_First_Email"] = np.where(train["Time_to_First_Email"] < 0, 0, train["Time_to_First_Email"])
RFM Segmentation: This approach segments customers based on their recency, frequency, and monetary value of their transactions. You already have some features related to transaction history, such as last_transaction_date, num_transactions, and OrderQuantity, which could be used to calculate these metrics. RFM segmentation can be helpful for identifying high-value customers, loyal customers, and customers who may be at risk of churn.
Behavioral Segmentation: This approach segments customers based on their behaviors and preferences, such as their preferred delivery day, online communication preferences, and response to email campaigns (MailOpened, MailClicked). You could also use features related to customer retention potential, such as RetentionPotential, to identify customer segments that are likely to require special attention or retention strategies.
Hybrid Segmentation: You could also combine multiple segmentation approaches to create a more comprehensive view of customer behavior and preferences. For example, you could combine RFM and Behavioral segmentation to identify high-value customers who are also responsive to email campaigns, or combine Behavioral and Demographic segmentation to identify customer segments with similar behavior and demographic characteristics.
#rfmsegmentation
#lets segment our customers bases upon their transaction records
#if we observe our last transaction was on date 2018-01-21
transaction["DateOfOrder"].describe()
count 104755 unique 3151 top 2014-01-13 00:00:00 freq 310 first 2008-07-29 00:00:00 last 2018-01-21 00:00:00 Name: DateOfOrder, dtype: object
#lets check days since first transaction for each record then we will figure out accordingly.
transaction['days_since'] = pd.to_datetime(transaction['DateOfOrder']).apply(lambda x: (pd.to_datetime("2018-01-21") - x).days)
## Compute recency, frequency, and monetary_value
def rec_freq_value(x):
recency = x['days_since'].min()#how recently a customer was active
first_purchase = x['days_since'].max()#out off all the days,the max difference is the first purchase he had made
frequency = x['days_since'].count()#how many times a customer had transacted between his first and last
order_value = x['OrderQuantity'].mean()#how much quantity he had ordered,average of the total orders he had ordered
c = ['recency', 'first_purchase', 'frequency', 'order_value']
return pd.Series([recency, first_purchase, frequency, order_value], index = c)
#applying the function on my whole records
customer_purchases = transaction.groupby('CustomerID').apply(rec_freq_value)
kmeans clustering:-partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid)
#customer_purchases
from sklearn.preprocessing import StandardScaler
#Bring the data on same scale
scaleobj = StandardScaler()
Scaled_Data = scaleobj.fit_transform(customer_purchases)
#Transform it back to dataframe
Scaled_Data = pd.DataFrame(Scaled_Data, index = customer_purchases.index, columns = customer_purchases.columns)
why we scaled data here? it is essential step in clustering,as we have to bring all the columns onto same scale.
#Scaled_Data
from sklearn.cluster import KMeans
sum_of_sq_dist = {}
for k in range(1,15):
km = KMeans(n_clusters= k, init= 'k-means++', max_iter= 1000)
km = km.fit(Scaled_Data)
sum_of_sq_dist[k] = km.inertia_
#Plot the graph for the sum of square distance values and Number of Clusters
sns.pointplot(x = list(sum_of_sq_dist.keys()), y = list(sum_of_sq_dist.values()))
plt.xlabel('Number of Clusters(k)')
plt.ylabel('Sum of Square Distances')
plt.title('Elbow Method For Optimal k')
plt.show()
AT k=5 it starts decreasing so here we will pick k as optimum no for our cluster
#Perform K-Mean Clustering or build the K-Means clustering model
KMean_clust = KMeans(n_clusters= 5, init= 'k-means++', max_iter= 1000)
KMean_clust.fit(Scaled_Data)
#Find the clusters for the observation given in the dataset
customer_purchases['Cluster'] = KMean_clust.labels_
customer_purchases.head()
| recency | first_purchase | frequency | order_value | Cluster | |
|---|---|---|---|---|---|
| CustomerID | |||||
| C12116 | 1564.0 | 1571.0 | 3.0 | 38.333333 | 1 |
| C12117 | 1508.0 | 1508.0 | 1.0 | 41.000000 | 1 |
| C12118 | 1594.0 | 1594.0 | 1.0 | 41.000000 | 1 |
| C12119 | 1552.0 | 1552.0 | 1.0 | 52.000000 | 1 |
| C12120 | 1472.0 | 1472.0 | 1.0 | 31.000000 | 1 |
#customer_purchases['Cluster'].value_counts()
from matplotlib import pyplot as plt
plt.figure(figsize=(7,7))
##Scatter Plot Frequency Vs Recency
Colors = ["red", "green", "blue","black","yellow"]
customer_purchases['Color'] = customer_purchases['Cluster'].map(lambda p: Colors[p])
ax = customer_purchases.plot(
kind="scatter",
x="recency", y="frequency",
figsize=(10,8),
c = customer_purchases['Color']
)
<Figure size 700x700 with 0 Axes>
customer_purchases['Color'].value_counts()
green 13103 red 4170 black 3979 blue 2161 yellow 407 Name: Color, dtype: int64
customer_purchases['Cluster'].value_counts()
1 13103 0 4170 3 3979 2 2161 4 407 Name: Cluster, dtype: int64
customer_purchases['Cluster Name'] = ''
customer_purchases['Cluster Name'][customer_purchases['Cluster']==0] = 'lessfrequent-midrecency'
customer_purchases['Cluster Name'][customer_purchases['Cluster']==1] = 'higherfrequency'
customer_purchases['Cluster Name'][customer_purchases['Cluster']==2] = 'lessfrequent,lessrecency'
customer_purchases['Cluster Name'][customer_purchases['Cluster']==3] = 'lessfrequent-higherrecency'
customer_purchases['Cluster Name'][customer_purchases['Cluster']==4] = 'lesserfrequent'
#customer_purchases
These are the categories we had obtained from RFM segmentation,but we cant solely base upon,so lets check RFM segmentation along with Customer email behaviour
RFM=transaction.groupby('CustomerID').apply(rec_freq_value)
RFM
| recency | first_purchase | frequency | order_value | |
|---|---|---|---|---|
| CustomerID | ||||
| C12116 | 1564.0 | 1571.0 | 3.0 | 38.333333 |
| C12117 | 1508.0 | 1508.0 | 1.0 | 41.000000 |
| C12118 | 1594.0 | 1594.0 | 1.0 | 41.000000 |
| C12119 | 1552.0 | 1552.0 | 1.0 | 52.000000 |
| C12120 | 1472.0 | 1472.0 | 1.0 | 31.000000 |
| ... | ... | ... | ... | ... |
| C42185 | 2210.0 | 2484.0 | 13.0 | 103.076923 |
| C42186 | 1520.0 | 1520.0 | 1.0 | 46.000000 |
| C42187 | 1589.0 | 1589.0 | 1.0 | 41.000000 |
| C42188 | 2301.0 | 2301.0 | 1.0 | 51.000000 |
| C42189 | 1528.0 | 1718.0 | 2.0 | 90.000000 |
23820 rows × 4 columns
#lets combine email behaviour with rfm
emailresponses=train[['CustomerID','Totalmails-notoponed', 'Totalmails-oponed', 'timesclicked-yes', 'timesclicked-no','numofemails']]
#rfm with email responsive
import pandas as pd
from sklearn.cluster import KMeans
import numpy as np
# Combine RFM scores and email response data
# Combine RFM scores and email response data
rfm_email = pd.merge(RFM,emailresponses, on=["CustomerID"])
rfm_email=rfm_email.drop(["CustomerID"],axis=1)
# Normalize the RFM scores and email response data
rfm_email_norm = (rfm_email - rfm_email.mean()) / rfm_email.std()
from sklearn.cluster import KMeans
sum_of_sq_dist = {}
for k in range(1,15):
km = KMeans(n_clusters= k, init= 'k-means++', max_iter= 1000)
km = km.fit(rfm_email_norm)
sum_of_sq_dist[k] = km.inertia_
#Plot the graph for the sum of square distance values and Number of Clusters
sns.pointplot(x = list(sum_of_sq_dist.keys()), y = list(sum_of_sq_dist.values()))
plt.xlabel('Number of Clusters(k)')
plt.ylabel('Sum of Square Distances')
plt.title('Elbow Method For Optimal k')
plt.show()
#Perform K-Mean Clustering or build the K-Means clustering model
KMean_clust = KMeans(n_clusters= 5, init= 'k-means++', max_iter= 1000)
KMean_clust.fit(rfm_email_norm)
#Find the clusters for the observation given in the dataset
rfm_email['Cluster'] = KMean_clust.labels_
rfm_email.head()
| recency | first_purchase | frequency | order_value | Totalmails-notoponed | Totalmails-oponed | timesclicked-yes | timesclicked-no | numofemails | Cluster | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1564.0 | 1571.0 | 3.0 | 38.333333 | 27 | 5 | 29 | 2 | 31 | 1 |
| 1 | 1508.0 | 1508.0 | 1.0 | 41.000000 | 0 | 1 | 0 | 0 | 1 | 4 |
| 2 | 1594.0 | 1594.0 | 1.0 | 41.000000 | 0 | 1 | 0 | 0 | 1 | 4 |
| 3 | 1552.0 | 1552.0 | 1.0 | 52.000000 | 28 | 14 | 37 | 4 | 41 | 3 |
| 4 | 1472.0 | 1472.0 | 1.0 | 31.000000 | 0 | 1 | 0 | 0 | 1 | 4 |
from matplotlib import pyplot as plt
plt.figure(figsize=(7,7))
##Scatter Plot Frequency Vs Recency
Colors = ["red", "green", "blue","black","yellow"]
rfm_email['Color'] = rfm_email['Cluster'].map(lambda p: Colors[p])
ax = rfm_email.plot(
kind="scatter",
x="recency", y="frequency",
figsize=(10,8),
c = rfm_email['Color']
)
<Figure size 700x700 with 0 Axes>
ax = rfm_email.plot(
kind="scatter",
x="order_value", y="frequency",
figsize=(10,8),
c = rfm_email['Color']
)
rfm_email['Cluster Name'] = ''
rfm_email['Cluster Name'][rfm_email['Cluster']==0] = 'midfrequency'
rfm_email['Cluster Name'][rfm_email['Cluster']==1] = 'Higherrecency-lessfrequent '
rfm_email['Cluster Name'][rfm_email['Cluster']==2] = 'recenttransactionswithlessfrequency'
rfm_email['Cluster Name'][rfm_email['Cluster']==3] = 'midfrequencywithrecenttransactionatintervals'
rfm_email['Cluster Name'][rfm_email['Cluster']==4] = 'higherfrequencyatdifferentintervals'
Assigning the segments we got from clustering analysis to main dataframe.
train = train.reset_index().drop('index', axis=1)
train['Cluster Name'] = rfm_email['Cluster Name']
train['Cluster Name']=train['Cluster Name'].astype("category")
Lets now try building different models for our customer retention potenial.
## Custom Function for Bar Plots
def barplot(column,df):
bar_plot1 = sns.countplot(x=column, data=df)
total = len(df[column])
for p in bar_plot1.patches:
percentage = '{:.2f}%'.format(100 * p.get_height()/total)
height = p.get_height()
bar_plot1.text(p.get_x()+ p.get_width()/2, height + 400, percentage, ha="center")
train[['totalorders','Totalorderquantity_in_first_quarter',
'Totalorderquantity_in_second_quarter',
'Totalorderquantity_in_Third_quarter']].describe()
| totalorders | Totalorderquantity_in_first_quarter | Totalorderquantity_in_second_quarter | Totalorderquantity_in_Third_quarter | |
|---|---|---|---|---|
| count | 23820.000000 | 23820.000000 | 23820.000000 | 23820.000000 |
| mean | 357.622922 | 96.159446 | 109.537280 | 151.926196 |
| std | 1335.649680 | 417.061493 | 472.822847 | 495.469196 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 41.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 65.000000 | 0.000000 | 0.000000 | 41.000000 |
| 75% | 169.000000 | 41.000000 | 52.000000 | 93.000000 |
| max | 51429.000000 | 13982.000000 | 20228.000000 | 17219.000000 |
sns.heatmap(train[['totalorders','Totalorderquantity_in_first_quarter',
'Totalorderquantity_in_second_quarter',
'Totalorderquantity_in_Third_quarter']].corr(),annot=True)
<Axes: >
# Loop over the rows of the dataframe and create a new column 'most_ordered_quarter'
for index, row in train.iterrows():
if row['Totalorderquantity_in_first_quarter'] > row['Totalorderquantity_in_second_quarter'] and row['Totalorderquantity_in_first_quarter'] > row['Totalorderquantity_in_Third_quarter']:
train.at[index, 'most_ordered_quarter'] = 'Q1'
elif row['Totalorderquantity_in_second_quarter'] > row['Totalorderquantity_in_first_quarter'] and row['Totalorderquantity_in_second_quarter'] > row['Totalorderquantity_in_Third_quarter']:
train.at[index, 'most_ordered_quarter'] = 'Q2'
else:
train.at[index, 'most_ordered_quarter'] = 'Q3'
train['most_ordered_quarter']=train['most_ordered_quarter'].astype("category")
barplot('most_ordered_quarter',train)
cat_cols = train.select_dtypes('category').columns
cat_cols
Index(['CustomerID', 'City', 'OnlineCommunication', 'AutomaticRefill',
'DoorstepDelivery', 'PreferredDeliveryDay', 'RetentionPotential',
'EmailType', 'MailOpened', 'MailClicked',
'days_since_first_transaction_category', 'email_duration_category',
'customer_segment', 'Cluster Name', 'most_ordered_quarter'],
dtype='object')
def find_chi2_independence(cat_col, target, df, alpha=0.05):
data = df[[target, cat_col]]
tab = sm.stats.Table.from_data(data)
tab = tab.table_orig.to_numpy()
print(f"---------------{target} Vs {cat_col} Chi Square Test of Independence -------------------")
print(f"\n Contingency table :\n")
print(tab)
stat, p, dof, expected = chi2_contingency(tab)
print(f"\n Expected table :\n")
print(expected)
print(f"The p value returned = {p} and degrees of freedom returned = {dof}")
# interpret p-value
print('significance(alpha) = %.3f' % (alpha))
if p <= alpha:
print('Dependent (reject H0)')
else:
print('Independent (fail to reject H0)')
for col in cat_cols:
find_chi2_independence(col, 'RetentionPotential', train)
---------------RetentionPotential Vs CustomerID Chi Square Test of Independence -------------------
Contingency table :
[[0 0 1 ... 1 0 0]
[1 1 0 ... 0 1 1]
[0 0 0 ... 0 0 0]]
Expected table :
[[0.15768262 0.15768262 0.15768262 ... 0.15768262 0.15768262 0.15768262]
[0.80541562 0.80541562 0.80541562 ... 0.80541562 0.80541562 0.80541562]
[0.03690176 0.03690176 0.03690176 ... 0.03690176 0.03690176 0.03690176]]
The p value returned = 0.4965535116908812 and degrees of freedom returned = 47638
significance(alpha) = 0.050
Independent (fail to reject H0)
---------------RetentionPotential Vs City Chi Square Test of Independence -------------------
Contingency table :
[[ 125 1257 866 1508]
[1155 5168 5503 7359]
[ 28 454 121 276]]
Expected table :
[[ 206.2488665 1084.69874055 1023.36020151 1441.69219144]
[1053.4836272 5540.45403023 5227.14735516 7363.91498741]
[ 48.2675063 253.84722922 239.49244332 337.39282116]]
The p value returned = 2.7318851965917284e-77 and degrees of freedom returned = 6
significance(alpha) = 0.050
Dependent (reject H0)
---------------RetentionPotential Vs OnlineCommunication Chi Square Test of Independence -------------------
Contingency table :
[[ 1893 1863]
[ 5740 13445]
[ 456 423]]
Expected table :
[[ 1275.49471033 2480.50528967]
[ 6515.00692695 12669.99307305]
[ 298.49836272 580.50163728]]
The p value returned = 1.1575020231150796e-156 and degrees of freedom returned = 2
significance(alpha) = 0.050
Dependent (reject H0)
---------------RetentionPotential Vs AutomaticRefill Chi Square Test of Independence -------------------
Contingency table :
[[ 3756 0]
[17048 2137]
[ 760 119]]
Expected table :
[[ 3400.26801008 355.73198992]
[17367.98236776 1817.01763224]
[ 795.74962217 83.25037783]]
The p value returned = 2.9763053132377616e-103 and degrees of freedom returned = 2
significance(alpha) = 0.050
Dependent (reject H0)
---------------RetentionPotential Vs DoorstepDelivery Chi Square Test of Independence -------------------
Contingency table :
[[ 3737 19]
[18340 845]
[ 847 32]]
Expected table :
[[ 3614.7163728 141.2836272 ]
[18463.34760705 721.65239295]
[ 845.93602015 33.06397985]]
The p value returned = 2.2611134234358053e-29 and degrees of freedom returned = 2
significance(alpha) = 0.050
Dependent (reject H0)
---------------RetentionPotential Vs PreferredDeliveryDay Chi Square Test of Independence -------------------
Contingency table :
[[ 654 757 675 854 636 180]
[3387 4328 3232 4273 2990 975]
[ 126 215 136 216 152 34]]
Expected table :
[[ 657.06347607 835.71788413 637.51083123 842.49823678 595.72493703
187.48463476]
[3356.16687657 4268.70277078 3256.29534005 4303.33564232 3042.86020151
957.63916877]
[ 153.76964736 195.57934509 149.19382872 197.16612091 139.41486146
43.87619647]]
The p value returned = 0.001334597616908662 and degrees of freedom returned = 10
significance(alpha) = 0.050
Dependent (reject H0)
---------------RetentionPotential Vs RetentionPotential Chi Square Test of Independence -------------------
Contingency table :
[[ 3756 0 0]
[ 0 19185 0]
[ 0 0 879]]
Expected table :
[[ 592.2559194 3025.14105793 138.60302267]
[ 3025.14105793 15451.89861461 707.96032746]
[ 138.60302267 707.96032746 32.43664987]]
The p value returned = 0.0 and degrees of freedom returned = 4
significance(alpha) = 0.050
Dependent (reject H0)
---------------RetentionPotential Vs EmailType Chi Square Test of Independence -------------------
Contingency table :
[[ 859 2897]
[12908 6277]
[ 344 535]]
Expected table :
[[ 2225.05944584 1530.94055416]
[11365.2197733 7819.7802267 ]
[ 520.72078086 358.27921914]]
The p value returned = 0.0 and degrees of freedom returned = 2
significance(alpha) = 0.050
Dependent (reject H0)
---------------RetentionPotential Vs MailOpened Chi Square Test of Independence -------------------
Contingency table :
[[ 495 3261]
[9403 9782]
[ 173 706]]
Expected table :
[[ 1588.02166247 2167.97833753]
[ 8111.3406801 11073.6593199 ]
[ 371.63765743 507.36234257]]
The p value returned = 0.0 and degrees of freedom returned = 2
significance(alpha) = 0.050
Dependent (reject H0)
---------------RetentionPotential Vs MailClicked Chi Square Test of Independence -------------------
Contingency table :
[[ 2897 786 73]
[ 6277 12138 770]
[ 535 267 77]]
Expected table :
[[ 1530.94055416 2079.99143577 145.06801008]
[ 7819.7802267 10624.23740554 740.98236776]
[ 358.27921914 486.77115869 33.94962217]]
The p value returned = 0.0 and degrees of freedom returned = 4
significance(alpha) = 0.050
Dependent (reject H0)
---------------RetentionPotential Vs days_since_first_transaction_category Chi Square Test of Independence -------------------
Contingency table :
[[ 188 429 3033 106]
[ 3399 3331 11880 575]
[ 524 229 100 26]]
Expected table :
[[ 648.23324937 628.99596977 2367.28916877 111.48161209]
[ 3311.06360202 3212.80289673 12091.70465995 569.42884131]
[ 151.70314861 147.2011335 554.00617128 26.0895466 ]]
The p value returned = 0.0 and degrees of freedom returned = 6
significance(alpha) = 0.050
Dependent (reject H0)
---------------RetentionPotential Vs email_duration_category Chi Square Test of Independence -------------------
Contingency table :
[[ 192 3116 354 94]
[1669 7251 5569 4696]
[ 19 587 103 170]]
Expected table :
[[ 296.44332494 1727.25541562 950.19546599 782.10579345]
[1514.1813602 8822.52267003 4853.43450882 3994.86146096]
[ 69.37531486 404.22191436 222.37002519 183.03274559]]
The p value returned = 0.0 and degrees of freedom returned = 6
significance(alpha) = 0.050
Dependent (reject H0)
---------------RetentionPotential Vs customer_segment Chi Square Test of Independence -------------------
Contingency table :
[[1791 1965]
[9916 9269]
[ 637 242]]
Expected table :
[[1946.43425693 1809.56574307]
[9942.05037783 9242.94962217]
[ 455.51536524 423.48463476]]
The p value returned = 6.094259288968507e-39 and degrees of freedom returned = 2
significance(alpha) = 0.050
Dependent (reject H0)
---------------RetentionPotential Vs Cluster Name Chi Square Test of Independence -------------------
Contingency table :
[[1082 1562 558 486 68]
[5477 8004 2797 2620 287]
[ 242 384 131 114 8]]
Expected table :
[[1072.39949622 1568.94206549 549.68161209 507.73803526 57.23879093]
[5477.63161209 8013.88539043 2807.67884131 2593.43828715 292.36586902]
[ 250.96889169 367.17254408 128.6395466 118.82367758 13.39534005]]
The p value returned = 0.5233392144502882 and degrees of freedom returned = 8
significance(alpha) = 0.050
Independent (fail to reject H0)
---------------RetentionPotential Vs most_ordered_quarter Chi Square Test of Independence -------------------
Contingency table :
[[ 547 1048 2161]
[ 3289 3703 12193]
[ 208 254 417]]
Expected table :
[[ 637.66851385 789.20151134 2329.12997481]
[ 3257.10075567 4031.10516373 11896.7940806 ]
[ 149.23073048 184.69332494 545.07594458]]
The p value returned = 3.2547730739300043e-47 and degrees of freedom returned = 4
significance(alpha) = 0.050
Dependent (reject H0)
From chi square test analysis we are sure that we will remove the column clustername as it is independent to our analysis and it had fail to reject h0.
#clustername-we should remove this as it is not adding benifit to our analysis
train=train.drop(['Cluster Name'],axis=1)
FEATURE ENGINEERING
train['Registration_month']=train['DateOfRegistration'].dt.month
train['Registration_weekday']=train['DateOfRegistration'].dt.weekday
train['lastemail_month']=train['lastemailDatetime'].dt.month
train['lastemail_weekday']=train['lastemailDatetime'].dt.weekday
train['transaction_month']=train['last_transaction_date'].dt.month
train['transaction_weekday']=train['last_transaction_date'].dt.weekday
train=train.drop(["last_transaction_date","lastemailDatetime","DateOfRegistration"],axis=1)
datetypecolumns
train=train.drop(['firstemailDatetime',"first_transaction_date","emailDatetime"],axis=1)
for col in ['Registration_month',
'Registration_weekday', 'lastemail_month', 'lastemail_weekday',
'transaction_month', 'transaction_weekday']:
train[col] =train[col].astype('category')
barplot('Registration_month',train)
barplot('Registration_weekday',train)
barplot('lastemail_month',train)
barplot('lastemail_weekday',train)
barplot('transaction_month',train)
barplot('transaction_weekday',train)
barplot('RegistrationYear',train)
Highlights from the plots above 1)When lesser mails are being lesser ammount of people are intrested in getting engaged in transactions.So engagment with customers is crucial
2)Customers acquired were highest in 2011,2013,2017
3)if we observe there are more transactions on weekends than on week days,while thursday,friday being the least.
4)if we observe transactions monthly we can see that during the start of winter season,i.e,there is a exponential growth of transactions till the end of january,so engagment with customers during this period becomes becomes crucial,and people registering are also highest during this period.
5)while registration,trasactions during summer period were least during this period.
train["RegistrationYear"]=train["RegistrationYear"].astype("category")
#feature elimination for numerical columns
train=train.drop(['Totalorderquantity_in_first_quarter',
'Totalorderquantity_in_second_quarter',
'Totalorderquantity_in_Third_quarter','minimumorderquantity', 'maximumorderquantity','averageorders'],axis=1)
num_cols=train.select_dtypes(include=['int64', 'float64'])
plt.figure(figsize=(10,12))
sns.heatmap(num_cols.corr(),annot=True)
<Axes: >
if we observe correlation for num of emails we can see that,for total mails not oponed,oponed,time clicked yes or no has highest correlation with num of mails,we can convert them into categorical columns like we did earlier lets try to do it
train['open_rate'] = np.log(train['Totalmails-oponed'] / train['numofemails'])
train['click_through_rate'] = np.log(train['totalorders'] / train['Totalmails-oponed'])
train['response_rate'] = np.log(train['totalorders'] / train['numofemails'])
train['open_rate_ratio'] =np.log(train['Totalmails-oponed'] / train['Totalmails-notoponed'])
train['click_through_rate_ratio'] = np.log(train['timesclicked-yes'] / train['timesclicked-no'])
train['response_rate_ratio'] = np.log(train['totalorders']) / (train['numofemails'] - train['totalorders'])
train[['open_rate',
'click_through_rate', 'response_rate', 'open_rate_ratio',
'click_through_rate_ratio', 'response_rate_ratio']].describe()
| open_rate | click_through_rate | response_rate | open_rate_ratio | click_through_rate_ratio | response_rate_ratio | |
|---|---|---|---|---|---|---|
| count | 23820.000000 | 2.382000e+04 | 2.382000e+04 | 2.382000e+04 | 1.411100e+04 | 2.377600e+04 |
| mean | -0.962155 | -inf | -inf | inf | NaN | NaN |
| std | 1.184260 | NaN | NaN | NaN | NaN | NaN |
| min | -4.290459 | -inf | -inf | -4.290459e+00 | -inf | -inf |
| 25% | -1.824549 | 2.708050e+00 | 1.082514e+00 | -1.686399e+00 | 2.335375e+00 | -9.692826e-02 |
| 50% | -0.342945 | 3.713572e+00 | 2.819948e+00 | 6.931472e-01 | 3.555348e+00 | -7.635177e-02 |
| 75% | 0.000000 | 4.094345e+00 | 3.713572e+00 | NaN | NaN | -2.853937e-02 |
| max | 0.693147 | 1.005410e+01 | 9.879758e+00 | inf | inf | inf |
# Replace inf and NaN values in selected columns
cols = ['open_rate_ratio', 'click_through_rate_ratio', 'response_rate_ratio','click_through_rate','open_rate','response_rate']
replace_value = 0 # Replace with NaN, or any other value you want
train[cols] = train[cols].replace([np.inf, -np.inf], 0)
train[cols] = train[cols].fillna(replace_value)
train=train.drop(['Totalmails-notoponed','Totalmails-oponed', 'timesclicked-yes', 'timesclicked-no'],axis=1)
#lets do modelling
train = train.rename(columns={'totalorders': 'sumoforderquantity'})
1)Monthlyorders,Yearlyorders,Montly mails,Yearly mails were highest during the period of 2013-2014.
2)Orders were more than 200000 in jan,september,october,november,december,and less than 200000 in other months.
3)For city 1 if we see number of customers per city are far too less compared to city 2,city3 and city4,from this we can draw out few conclusions which might or might not be true.
4)If we observe here city 2 has highest number of customers with medium retention potenial,so here we should not lose those customers aas they are valuable for our analysis.
5)City 4 is having 7359 customers with low retention potenial.Their risk of churning is very higher compared to other cities.so,reaccesment should take place such that churning do not happen.
6)Average order value from city1 is higher than city3 and there are less customers in city 1 compared to others,and city 3 has lesser orderquantity during 2nd quarter of a year.
7)Till 2013 customers registrations,customers acquired exponentially,but after 2013 suddenly registration were dropped,and in 2016 again it increased and then dropped. Here we have to find out the reason why in year 2013, no. of registrations were dropped.Earlier,if we observed our transaction and mails visualizations it had also shown that it was good during the period of 2013-2014 but after that it had dropped,no of registrations might be one of the potenial reasons.
8)People who have received only mail,880 customers had not even oponed the mail.people who received only mail are people with welcome mails,we can understand that 880 customers were not intrested as they had not oponed.
9)Customers with long term history are lesser likely to have online communication.
10)Customers with moderate history,long term history have few customers with medium retention possibilty.this would be useful for our analysis as our focus is on customers with medium retention potenial because if we can concentrate on them we can improve our business,as they had already been customer since so much time we have to plan something for them such that they would remain loyal customers.
1)Online communication -Medium(target=retentionpotenial(medium) have a lesser percentage of customers who are preferring online communication
2)Automatic refill -High(target=retentionpotenial(high)) has not at all preffered to have automatic refill.
3)Doorstep delivery - customers across all the targets had the same ammount of preference,most of them wanted to have doorstep delivery.
4)Mail oponed -Low(target=retentionpotenial(low) has compartively higher percentage of customers not opening the mail.
5)Mail clicked-medium (target=retentionpotenial(medium) has clicked the mail most than the others.
6)days_since_first_transaction category:- a)only transaction:-customers with higher chances of retention are more. b)moderate history:-Customers with medium chances of retention are higher. c)Long-term history:-Customers with medium chances of retention are higher. d)Recent -transaction :-customers were nearly equal across all the categories of retention potenial.
7)Email_duration_category:- a)Only email:-High(retentionpotenial(high) has more no of customers with only email sent b)Long mail history:-LOW(retentionpotenial(low)) had more no of customers with long mail history. c)very long mail history:LOW(retentionpotenial(low)) had more no of customers with very long mail history. d)medium history:-it is mostly same across all the graphs.
8)Customersegment:-if we observe here large order customers are more than the small order customers for retention potenial-medium customers
9)Here we are comparing order quantities by no. of registrations every 2 years. If we observe year 2010 had been the highest ordered quantity for medium customers and high customers Intresting point is that the customers who are more likely to churn have the highest oreder quantity rate in 2018.
10)There are 355 medium retention potenial customers with long term transaction history and only mail being sent to them.Customers with medium retention potenial and long term transaction history are compartively more than the others,so we can target this customers and prevent them from churning.
11)There are 740 customers who had not prefered doorstepdelivery and no automatic refill which there might be some underlying factors,so maybe in near future company should improve the following services.
# split into input (X) and output (y) variables
X=train.drop(['CustomerID','RetentionPotential','days_since_first_transaction_category', 'email_duration_category'], axis=1)
y=train["RetentionPotential"]
#label encoding of target variable
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y=le.fit_transform(y)
cat_cols=train.select_dtypes(include=['category'])
cat_cols=cat_cols.drop(["CustomerID","RetentionPotential",'days_since_first_transaction_category', 'email_duration_category'],axis=1)
cat_cols=cat_cols.columns
num_cols=train.select_dtypes(include=['int64', 'float64'])
num_cols=num_cols.drop(['open_rate','click_through_rate', 'response_rate', 'open_rate_ratio','click_through_rate_ratio', 'response_rate_ratio'],axis=1)
num_cols=num_cols.columns
## Convert Categorical Columns to Dummies
X = pd.get_dummies(X, columns=cat_cols,drop_first=True)
#scaling
scaler = StandardScaler()
# scale on train
X[num_cols] =scaler.fit_transform(X[num_cols])
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, stratify = y,test_size=0.25,random_state=107,shuffle=True)
print('X_Train Size:', (X_train.shape))
print('X _test Size:', (X_test.shape))
print('Y_train Size:', (y_train.shape))
print('Y_test Size:', (y_test.shape))
X_Train Size: (17865, 97) X _test Size: (5955, 97) Y_train Size: (17865,) Y_test Size: (5955,)
# Function for Classification Report
def classifcation_report_train_test(y_train, y_train_pred, y_test, y_test_pred):
print('''
*****************************************
CLASSIFICATION REPORT FOR TRAIN DATA
*****************************************
''')
print(classification_report(y_train, y_train_pred, digits=4))
print('''
*****************************************
CLASSIFICATION REPORT FOR TEST DATA
*****************************************
''')
print(classification_report(y_test, y_test_pred, digits=4))
#defining error metrics
scores = pd.DataFrame(columns=['Model','Train_Accuracy','Train_Recall','Test_Accuracy','Test_Recall'])
def get_metrics(train_actual,train_predicted,test_actual,test_predicted,model_description,dataframe):
train_accuracy = accuracy_score(train_actual,train_predicted)
train_recall = recall_score(train_actual,train_predicted,average="macro")
test_accuracy = accuracy_score(test_actual,test_predicted)
test_recall = recall_score(test_actual,test_predicted,average="macro")
dataframe = dataframe.append(pd.Series([model_description, train_accuracy,train_recall,
test_accuracy,test_recall],
index=scores.columns ), ignore_index=True)
return(dataframe)
We will experiment different models until we find out a generalized model out of all.
1)naivebayes classifier
2)decision tree classifier
3)random forest classifier
4)support vector machines classifier
5)XGBOOST
6)boosting techniques
7)neuralnetworks
8)knn classifier
we will choose the best model from all and then try to draw learning patterns for all this models.
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import recall_score
from sklearn import svm
NB1 = GaussianNB()
NB1=NB1.fit(X_train, y_train)
y_pred_train =NB1.predict(X_train)
y_pred_train[0:10]
# Make predictions on the testing set and evaluate accuracy
y_pred_test = NB1.predict(X_test)
y_pred_test[0:10]
print("trainaccuracy",accuracy_score(y_train,y_pred_train))
print("testaccuracy",accuracy_score(y_test,y_pred_test))
print("recall-train",recall_score(y_train,y_pred_train,average='macro'))
print("recall-test",recall_score(y_test,y_pred_test,average='macro'))
classifcation_report_train_test(y_train,y_pred_train,y_test,y_pred_test)
scores = get_metrics(y_train,y_pred_train,y_test,y_pred_test,"GaussianNB",scores)
scores
trainaccuracy 0.6920794850265883
testaccuracy 0.6886649874055416
recall-train 0.7184312500115704
recall-test 0.7101471229574522
*****************************************
CLASSIFICATION REPORT FOR TRAIN DATA
*****************************************
precision recall f1-score support
0 0.3585 0.8931 0.5116 2817
1 0.9657 0.6567 0.7817 14389
2 0.3761 0.6055 0.4640 659
accuracy 0.6921 17865
macro avg 0.5667 0.7184 0.5858 17865
weighted avg 0.8482 0.6921 0.7274 17865
*****************************************
CLASSIFICATION REPORT FOR TEST DATA
*****************************************
precision recall f1-score support
0 0.3561 0.9010 0.5104 939
1 0.9628 0.6522 0.7776 4796
2 0.3848 0.5773 0.4618 220
accuracy 0.6887 5955
macro avg 0.5679 0.7101 0.5833 5955
weighted avg 0.8457 0.6887 0.7238 5955
| Model | Train_Accuracy | Train_Recall | Test_Accuracy | Test_Recall | |
|---|---|---|---|---|---|
| 0 | GaussianNB | 0.692079 | 0.718431 | 0.688665 | 0.710147 |
from sklearn.model_selection import learning_curve
from sklearn.model_selection import StratifiedKFold
# Define the stratified k-fold cross-validator
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
train_sizes, train_scores, test_scores = learning_curve(NB1, X_train, y_train, cv=skf, scoring='accuracy', n_jobs=-1, train_sizes=np.linspace(0.1, 1.0, 10))
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.figure(figsize=(8,6))
plt.grid()
plt.title("Learning Curve")
plt.xlabel("Training Examples")
plt.ylabel("Score")
plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score")
plt.legend(loc="best")
plt.show()
As per observation from learning curve,Naive bayes classifier models seems to underfit.The reason for our model underfitting could be lesser training data,but if we observe classification report we could see that recall for target label is also low.So,we will further decide on checking other models.Here we are not getting desired perfomance.
clf_dt = DecisionTreeClassifier()
clf_dt.fit(X=X_train, y= y_train)
y_pred_train = clf_dt.predict(X_train)
y_pred_test = clf_dt.predict(X_test)
print("Train Accuracy:", accuracy_score(y_train,y_pred_train))
# print("Train Classification Report:")
# print(classification_report(y_train,y_pred_train,digits=4))
print("Test Accuracy:",accuracy_score(y_test,y_pred_test))
# print("Test Classification Report:")
#print(classification_report(y_test,y_pred_test,digits=4))
classifcation_report_train_test(y_train,y_pred_train,y_test,y_pred_test)
scores = get_metrics(y_train,y_pred_train,y_test,y_pred_test,"Decisiontree",scores)
scores
Train Accuracy: 0.9876294430450602
Test Accuracy: 0.8777497900923593
*****************************************
CLASSIFICATION REPORT FOR TRAIN DATA
*****************************************
precision recall f1-score support
0 0.9493 0.9769 0.9629 2817
1 0.9949 0.9897 0.9923 14389
2 0.9985 0.9879 0.9931 659
accuracy 0.9876 17865
macro avg 0.9809 0.9848 0.9828 17865
weighted avg 0.9878 0.9876 0.9877 17865
*****************************************
CLASSIFICATION REPORT FOR TEST DATA
*****************************************
precision recall f1-score support
0 0.6302 0.6880 0.6578 939
1 0.9326 0.9206 0.9265 4796
2 0.8469 0.7545 0.7981 220
accuracy 0.8777 5955
macro avg 0.8033 0.7877 0.7942 5955
weighted avg 0.8818 0.8777 0.8794 5955
| Model | Train_Accuracy | Train_Recall | Test_Accuracy | Test_Recall | |
|---|---|---|---|---|---|
| 0 | GaussianNB | 0.692079 | 0.718431 | 0.688665 | 0.710147 |
| 1 | Decisiontree | 0.987629 | 0.984834 | 0.877750 | 0.787690 |
from sklearn.model_selection import learning_curve
train_sizes, train_scores, test_scores = learning_curve(clf_dt, X_train, y_train, cv=skf, scoring='accuracy', n_jobs=-1, train_sizes=np.linspace(0.1, 1.0, 10))
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.figure(figsize=(8,6))
plt.grid()
plt.title("Learning Curve")
plt.xlabel("Training Examples")
plt.ylabel("Score")
plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score")
plt.legend(loc="best")
plt.show()
As per observation from our learning curve we could see that model is overfitting.Best models should always generalize well when faced with instances that were not part of the initial training data.There is a large gap between training and validation data.To adress this issue we can either add more data or do hyperparamter tuning for better results.
# Import Randomforest Library
from sklearn.ensemble import RandomForestClassifier
clf_rf = RandomForestClassifier()
clf_rf.fit(X=X_train, y= y_train)
y_pred_train = clf_rf.predict(X_train)
y_pred_test = clf_rf.predict(X_test)
print("Train Accuracy:", accuracy_score(y_train,y_pred_train))
# print("Train Classification Report:")
# print(classification_report(y_train,y_pred_train,digits=4))
print("Test Accuracy:",accuracy_score(y_test,y_pred_test))
# print("Test Classification Report:")
#print(classification_report(y_test,y_pred_test,digits=4))
classifcation_report_train_test(y_train,y_pred_train,y_test,y_pred_test)
scores = get_metrics(y_train,y_pred_train,y_test,y_pred_test,"randomforest",scores)
scores
Train Accuracy: 0.9875734676742234
Test Accuracy: 0.9079764903442485
*****************************************
CLASSIFICATION REPORT FOR TRAIN DATA
*****************************************
precision recall f1-score support
0 0.9780 0.9457 0.9616 2817
1 0.9889 0.9958 0.9923 14389
2 0.9985 0.9879 0.9931 659
accuracy 0.9876 17865
macro avg 0.9884 0.9764 0.9823 17865
weighted avg 0.9875 0.9876 0.9875 17865
*****************************************
CLASSIFICATION REPORT FOR TEST DATA
*****************************************
precision recall f1-score support
0 0.8099 0.6124 0.6974 939
1 0.9203 0.9750 0.9468 4796
2 0.9512 0.7091 0.8125 220
accuracy 0.9080 5955
macro avg 0.8938 0.7655 0.8189 5955
weighted avg 0.9040 0.9080 0.9025 5955
| Model | Train_Accuracy | Train_Recall | Test_Accuracy | Test_Recall | |
|---|---|---|---|---|---|
| 0 | GaussianNB | 0.692079 | 0.718431 | 0.688665 | 0.710147 |
| 1 | Decisiontree | 0.987629 | 0.984834 | 0.877750 | 0.787690 |
| 2 | randomforest | 0.987573 | 0.976436 | 0.907976 | 0.765475 |
It is also example of overfiiting as there is a huge gap between training and validation sets.
from sklearn.model_selection import learning_curve
train_sizes, train_scores, test_scores = learning_curve(clf_rf, X_train, y_train, cv=5, scoring='accuracy', n_jobs=-1, train_sizes=np.linspace(0.1, 1.0, 10))
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.figure(figsize=(8,6))
plt.grid()
plt.title("Learning Curve")
plt.xlabel("Training Examples")
plt.ylabel("Score")
plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score")
plt.legend(loc="best")
plt.show()
As per observation from our learning curve we could see that model is overfitting.Best models should always generalize well when faced with instances that were not part of the initial training data.There is a large gap between training and validation data.To adress this issue we can either add more data or do hyperparamter tuning for better results.
importances = clf_rf.feature_importances_
importances
np.argsort(importances)[::-1]
indices = np.argsort(importances)[::-1]
ind_attr_names = X_train.columns
pd.DataFrame([ind_attr_names[indices], np.sort(importances)[::-1]])
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | recency | Time_since_Last_Email | numofemails | response_rate | Time_to_First_Email | open_rate | duration_of_customer_relation | click_through_rate | sumoforderquantity | response_rate_ratio | open_rate_ratio | EmailType_WelcomeEmail | OrderQuantity | OnlineCommunication_yes | num_transactions | email_duration | average_time_between_mails | days_since_first_transaction | click_through_rate_ratio | PreferredDeliveryDay_Tuesday | PreferredDeliveryDay_Wednesday | PreferredDeliveryDay_Monday | PreferredDeliveryDay_Thursday | MailOpened_yes | MailClicked_no | average_time_between_purchase | AutomaticRefill_yes | City_CITY4 | City_CITY2 | City_CITY3 | RegistrationYear_2013 | most_ordered_quarter_Q2 | RegistrationYear_2011 | Registration_month_9 | RegistrationYear_2017 | transaction_month_9 | customer_segment_small_order_customer | most_ordered_quarter_Q3 | transaction_weekday_1 | Registration_weekday_3 | Registration_weekday_1 | lastemail_month_9 | lastemail_weekday_2 | transaction_weekday_2 | lastemail_weekday_1 | transaction_weekday_3 | Registration_weekday_2 | RegistrationYear_2014 | PreferredDeliveryDay_weekend | lastemail_weekday_3 | lastemail_weekday_6 | transaction_weekday_6 | lastemail_weekday_4 | Registration_weekday_6 | Registration_weekday_4 | Registration_month_8 | transaction_weekday_4 | lastemail_month_12 | Registration_weekday_5 | transaction_weekday_5 | lastemail_month_10 | lastemail_weekday_5 | RegistrationYear_2010 | RegistrationYear_2012 | Registration_month_10 | transaction_month_10 | MailClicked_yes | lastemail_month_8 | Registration_month_12 | transaction_month_11 | lastemail_month_11 | Registration_month_11 | RegistrationYear_2015 | transaction_month_8 | transaction_month_12 | DoorstepDelivery_yes | lastemail_month_5 | Registration_month_7 | Registration_month_5 | transaction_month_6 | transaction_month_7 | Registration_month_6 | lastemail_month_7 | lastemail_month_6 | transaction_month_5 | RegistrationYear_2009 | Registration_month_4 | lastemail_month_3 | transaction_month_2 | Registration_month_3 | transaction_month_4 | Registration_month_2 | transaction_month_3 | lastemail_month_4 | lastemail_month_2 | RegistrationYear_2016 | RegistrationYear_2018 |
| 1 | 0.080549 | 0.074629 | 0.07201 | 0.06425 | 0.048779 | 0.046031 | 0.040313 | 0.039392 | 0.038414 | 0.037904 | 0.027456 | 0.025767 | 0.024871 | 0.023205 | 0.021229 | 0.020673 | 0.019571 | 0.015557 | 0.014539 | 0.010885 | 0.010312 | 0.010236 | 0.010111 | 0.010066 | 0.009433 | 0.009006 | 0.008966 | 0.00853 | 0.007734 | 0.007527 | 0.006926 | 0.005187 | 0.004619 | 0.004591 | 0.004321 | 0.004318 | 0.003724 | 0.003706 | 0.003638 | 0.003621 | 0.003619 | 0.003505 | 0.003488 | 0.003415 | 0.003393 | 0.003344 | 0.003334 | 0.003203 | 0.003192 | 0.00315 | 0.003083 | 0.003067 | 0.003012 | 0.002975 | 0.002961 | 0.002919 | 0.002823 | 0.002699 | 0.00263 | 0.0026 | 0.002581 | 0.002463 | 0.002308 | 0.002238 | 0.002232 | 0.002213 | 0.002098 | 0.002087 | 0.002046 | 0.001989 | 0.001966 | 0.001933 | 0.001882 | 0.001784 | 0.001651 | 0.001415 | 0.001403 | 0.001311 | 0.001285 | 0.001247 | 0.001227 | 0.001225 | 0.001169 | 0.001094 | 0.001083 | 0.00108 | 0.001022 | 0.000999 | 0.000999 | 0.000996 | 0.000987 | 0.00096 | 0.000925 | 0.000912 | 0.000867 | 0.000775 | 0.000543 |
from sklearn.svm import SVC
clf_svc = SVC()
clf_svc.fit(X=X_train, y= y_train)
y_pred_train = clf_svc.predict(X_train)
y_pred_test = clf_svc.predict(X_test)
print("Train Accuracy:", accuracy_score(y_train,y_pred_train))
# print("Train Classification Report:")
# print(classification_report(y_train,y_pred_train,digits=4))
print("Test Accuracy:",accuracy_score(y_test,y_pred_test))
# print("Test Classification Report:")
#print(classification_report(y_test,y_pred_test,digits=4))
classifcation_report_train_test(y_train,y_pred_train,y_test,y_pred_test)
scores = get_metrics(y_train,y_pred_train,y_test,y_pred_test,"SVMCLASSIFIER",scores)
scores
Train Accuracy: 0.8963336132101876
Test Accuracy: 0.8960537363560034
*****************************************
CLASSIFICATION REPORT FOR TRAIN DATA
*****************************************
precision recall f1-score support
0 0.7425 0.5815 0.6522 2817
1 0.9171 0.9681 0.9419 14389
2 0.9488 0.6753 0.7890 659
accuracy 0.8963 17865
macro avg 0.8695 0.7416 0.7944 17865
weighted avg 0.8907 0.8963 0.8906 17865
*****************************************
CLASSIFICATION REPORT FOR TEST DATA
*****************************************
precision recall f1-score support
0 0.7552 0.5783 0.6550 939
1 0.9146 0.9693 0.9412 4796
2 0.9412 0.6545 0.7721 220
accuracy 0.8961 5955
macro avg 0.8703 0.7341 0.7894 5955
weighted avg 0.8905 0.8961 0.8898 5955
| Model | Train_Accuracy | Train_Recall | Test_Accuracy | Test_Recall | |
|---|---|---|---|---|---|
| 0 | GaussianNB | 0.692079 | 0.718431 | 0.688665 | 0.710147 |
| 1 | Decisiontree | 0.987629 | 0.984834 | 0.877750 | 0.787690 |
| 2 | randomforest | 0.987573 | 0.976436 | 0.907976 | 0.765475 |
| 3 | SVMCLASSIFIER | 0.896334 | 0.741612 | 0.896054 | 0.734057 |
from sklearn.model_selection import learning_curve
train_sizes, train_scores, test_scores = learning_curve(clf_svc, X_train, y_train, cv=5, scoring='accuracy', n_jobs=-1, train_sizes=np.linspace(0.1, 1.0, 10))
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.figure(figsize=(8,6))
plt.grid()
plt.title("Learning Curve")
plt.xlabel("Training Examples")
plt.ylabel("Score")
plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score")
plt.legend(loc="best")
plt.show()
This comes under a generalized model,we will further analyze by learning curves while doing hyper parameter tuning,but the problem with svm is it takes so much time,we will check for alternative models if we didnt find any then we will use svm classifier
# Import XGBClassifier Library
from xgboost import XGBClassifier
clf_XGB = XGBClassifier()
clf_XGB.fit(X=X_train, y= y_train)
y_pred_train = clf_XGB.predict(X_train)
y_pred_test = clf_XGB.predict(X_test)
print("Train Accuracy:", accuracy_score(y_train,y_pred_train))
# print("Train Classification Report:")
# print(classification_report(y_train,y_pred_train,digits=4))
print("Test Accuracy:",accuracy_score(y_test,y_pred_test))
# print("Test Classification Report:")
#print(classification_report(y_test,y_pred_test,digits=4))
classifcation_report_train_test(y_train,y_pred_train,y_test,y_pred_test)
scores = get_metrics(y_train,y_pred_train,y_test,y_pred_test,"XGBOOST",scores)
scores
Train Accuracy: 0.9540442205429611
Test Accuracy: 0.9182199832073887
*****************************************
CLASSIFICATION REPORT FOR TRAIN DATA
*****************************************
precision recall f1-score support
0 0.9415 0.7654 0.8443 2817
1 0.9540 0.9907 0.9720 14389
2 1.0000 0.9605 0.9799 659
accuracy 0.9540 17865
macro avg 0.9652 0.9055 0.9321 17865
weighted avg 0.9537 0.9540 0.9522 17865
*****************************************
CLASSIFICATION REPORT FOR TEST DATA
*****************************************
precision recall f1-score support
0 0.8375 0.6475 0.7303 939
1 0.9286 0.9756 0.9515 4796
2 0.9526 0.8227 0.8829 220
accuracy 0.9182 5955
macro avg 0.9062 0.8153 0.8549 5955
weighted avg 0.9151 0.9182 0.9141 5955
| Model | Train_Accuracy | Train_Recall | Test_Accuracy | Test_Recall | |
|---|---|---|---|---|---|
| 0 | GaussianNB | 0.692079 | 0.718431 | 0.688665 | 0.710147 |
| 1 | Decisiontree | 0.987629 | 0.984834 | 0.877750 | 0.787690 |
| 2 | randomforest | 0.987573 | 0.976436 | 0.907976 | 0.765475 |
| 3 | SVMCLASSIFIER | 0.896334 | 0.741612 | 0.896054 | 0.734057 |
| 4 | XGBOOST | 0.954044 | 0.905529 | 0.918220 | 0.815276 |
XGBOOST CLASSIFIER is overfitting,which means it might not be good when it is predicting test data or unseen instances.We will further see and check whether we should use this or not
ADABOOST CLASSIFIER
clf_ada = AdaBoostClassifier(base_estimator=GaussianNB(),random_state=123)
clf_ada.fit(X_train, y_train)
y_pred_train = clf_ada.predict(X_train)
y_pred_test = clf_ada.predict(X_test)
print("Train Accuracy:", accuracy_score(y_train,y_pred_train))
# print("Train Classification Report:")
# print(classification_report(y_train,y_pred_train,digits=4))
print("Test Accuracy:",accuracy_score(y_test,y_pred_test))
# print("Test Classification Report:")
#print(classification_report(y_test,y_pred_test,digits=4))
classifcation_report_train_test(y_train,y_pred_train,y_test,y_pred_test)
scores = get_metrics(y_train,y_pred_train,y_test,y_pred_test,"AdaBoostClassifier",scores)
scores
Train Accuracy: 0.7385950181919955
Test Accuracy: 0.7390428211586901
*****************************************
CLASSIFICATION REPORT FOR TRAIN DATA
*****************************************
precision recall f1-score support
0 0.3239 0.3039 0.3136 2817
1 0.8428 0.8456 0.8442 14389
2 0.2188 0.2610 0.2381 659
accuracy 0.7386 17865
macro avg 0.4618 0.4701 0.4653 17865
weighted avg 0.7380 0.7386 0.7382 17865
*****************************************
CLASSIFICATION REPORT FOR TEST DATA
*****************************************
precision recall f1-score support
0 0.2953 0.2673 0.2806 939
1 0.8395 0.8543 0.8468 4796
2 0.2356 0.2409 0.2382 220
accuracy 0.7390 5955
macro avg 0.4568 0.4542 0.4552 5955
weighted avg 0.7314 0.7390 0.7351 5955
| Model | Train_Accuracy | Train_Recall | Test_Accuracy | Test_Recall | |
|---|---|---|---|---|---|
| 0 | GaussianNB | 0.692079 | 0.718431 | 0.688665 | 0.710147 |
| 1 | Decisiontree | 0.987629 | 0.984834 | 0.877750 | 0.787690 |
| 2 | randomforest | 0.987573 | 0.976436 | 0.907976 | 0.765475 |
| 3 | SVMCLASSIFIER | 0.896334 | 0.741612 | 0.896054 | 0.734057 |
| 4 | XGBOOST | 0.954044 | 0.905529 | 0.918220 | 0.815276 |
| 5 | AdaBoostClassifier | 0.738595 | 0.470149 | 0.739043 | 0.454156 |
An underfitting model with lesser recall score for target '2' in both train and validation sets
clf_GBM = GradientBoostingClassifier()
clf_GBM.fit(X=X_train, y=y_train)
y_pred_train = clf_GBM.predict(X_train)
y_pred_test = clf_GBM.predict(X_test)
print("Train Accuracy:", accuracy_score(y_train,y_pred_train))
# print("Train Classification Report:")
# print(classification_report(y_train,y_pred_train,digits=4))
print("Test Accuracy:",accuracy_score(y_test,y_pred_test))
# print("Test Classification Report:")
#print(classification_report(y_test,y_pred_test,digits=4))
classifcation_report_train_test(y_train,y_pred_train,y_test,y_pred_test)
scores = get_metrics(y_train,y_pred_train,y_test,y_pred_test,"GBM",scores)
Train Accuracy: 0.9196193674783095
Test Accuracy: 0.911167086481948
*****************************************
CLASSIFICATION REPORT FOR TRAIN DATA
*****************************************
precision recall f1-score support
0 0.8710 0.6134 0.7199 2817
1 0.9241 0.9836 0.9529 14389
2 0.9682 0.8316 0.8947 659
accuracy 0.9196 17865
macro avg 0.9211 0.8095 0.8558 17865
weighted avg 0.9174 0.9196 0.9140 17865
*****************************************
CLASSIFICATION REPORT FOR TEST DATA
*****************************************
precision recall f1-score support
0 0.8480 0.5942 0.6988 939
1 0.9185 0.9798 0.9481 4796
2 0.9337 0.7682 0.8429 220
accuracy 0.9112 5955
macro avg 0.9001 0.7807 0.8299 5955
weighted avg 0.9079 0.9112 0.9049 5955
from sklearn.model_selection import learning_curve
train_sizes, train_scores, test_scores = learning_curve(clf_GBM, X_train, y_train, cv=5, scoring='accuracy', n_jobs=-1, train_sizes=np.linspace(0.1, 1.0, 10))
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.figure(figsize=(8,6))
plt.grid()
plt.title("Learning Curve")
plt.xlabel("Training Examples")
plt.ylabel("Score")
plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score")
plt.legend(loc="best")
plt.show()
If the gap between the training and validation performance in the learning curve is small, it generally indicates that the model is generalizing well and is not overfitting on the training data. This means that the model is learning patterns that are useful for making predictions on new, unseen data.
from sklearn.neighbors import KNeighborsClassifier
# Create a KNN classifier with k=5
clf_knn = KNeighborsClassifier(n_neighbors=5)
# Fit the classifier on the training data
clf_knn.fit(X_train, y_train)
# Make predictions on the training and test data
y_pred_train = clf_knn.predict(X_train)
y_pred_test = clf_knn.predict(X_test)
# Print the accuracy scores for the training and test data
print("Train Accuracy:", accuracy_score(y_train,y_pred_train))
print("Test Accuracy:",accuracy_score(y_test,y_pred_test))
# Print the classification report for the training and test data
classifcation_report_train_test(y_train,y_pred_train,y_test,y_pred_test)
# Get the metrics for the classifier and add them to the scores dictionary
scores = get_metrics(y_train,y_pred_train,y_test,y_pred_test,"KNN",scores)
Train Accuracy: 0.9014273719563393
Test Accuracy: 0.8723761544920235
*****************************************
CLASSIFICATION REPORT FOR TRAIN DATA
*****************************************
precision recall f1-score support
0 0.7932 0.5815 0.6710 2817
1 0.9146 0.9783 0.9454 14389
2 0.9511 0.5903 0.7285 659
accuracy 0.9014 17865
macro avg 0.8863 0.7167 0.7816 17865
weighted avg 0.8968 0.9014 0.8941 17865
*****************************************
CLASSIFICATION REPORT FOR TEST DATA
*****************************************
precision recall f1-score support
0 0.6958 0.4579 0.5523 939
1 0.8919 0.9683 0.9285 4796
2 0.9308 0.5500 0.6914 220
accuracy 0.8724 5955
macro avg 0.8395 0.6587 0.7241 5955
weighted avg 0.8624 0.8724 0.8604 5955
The model has a higher accuracy on the training set (90.14%) than on the test set (87.24%), which suggests that the model may be overfitting to the training data.The recall values for the first and third classes are relatively low, especially in the test set, which suggests that the model may be struggling to correctly identify instances of these classes.
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
# Define the model architecture
model = Sequential([
Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
Dropout(0.2),
Dense(32, activation='relu'),
Dropout(0.2),
Dense(3, activation='softmax')
])
# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# Train the model
history = model.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test))
# Get the predicted class probabilities for the train and test data
y_pred_train_prob = model.predict(X_train)
y_pred_test_prob = model.predict(X_test)
# Get the predicted classes by selecting the class with the highest probability
y_pred_train = np.argmax(y_pred_train_prob, axis=1)
y_pred_test = np.argmax(y_pred_test_prob, axis=1)
# Print the accuracy scores for the training and test data
print("Train Accuracy:", accuracy_score(y_train,y_pred_train))
print("Test Accuracy:",accuracy_score(y_test,y_pred_test))
# Print the classification report for the training and test data
classifcation_report_train_test(y_train,y_pred_train,y_test,y_pred_test)
# Get the metrics for the classifier and add them to the scores dictionary
scores = get_metrics(y_train,y_pred_train,y_test,y_pred_test,"ANN-CLASSIFICATION",scores)
Epoch 1/10
559/559 [==============================] - 4s 5ms/step - loss: 0.4211 - accuracy: 0.8355 - val_loss: 0.3055 - val_accuracy: 0.8806
Epoch 2/10
559/559 [==============================] - 2s 4ms/step - loss: 0.3163 - accuracy: 0.8704 - val_loss: 0.2765 - val_accuracy: 0.8877
Epoch 3/10
559/559 [==============================] - 2s 4ms/step - loss: 0.2915 - accuracy: 0.8818 - val_loss: 0.2621 - val_accuracy: 0.8969
Epoch 4/10
559/559 [==============================] - 4s 7ms/step - loss: 0.2736 - accuracy: 0.8897 - val_loss: 0.2537 - val_accuracy: 0.8969
Epoch 5/10
559/559 [==============================] - 3s 4ms/step - loss: 0.2611 - accuracy: 0.8919 - val_loss: 0.2513 - val_accuracy: 0.8959
Epoch 6/10
559/559 [==============================] - 2s 3ms/step - loss: 0.2552 - accuracy: 0.8943 - val_loss: 0.2549 - val_accuracy: 0.8954
Epoch 7/10
559/559 [==============================] - 1s 3ms/step - loss: 0.2498 - accuracy: 0.8966 - val_loss: 0.2439 - val_accuracy: 0.9003
Epoch 8/10
559/559 [==============================] - 1s 3ms/step - loss: 0.2424 - accuracy: 0.8983 - val_loss: 0.2416 - val_accuracy: 0.8992
Epoch 9/10
559/559 [==============================] - 1s 2ms/step - loss: 0.2395 - accuracy: 0.9004 - val_loss: 0.2402 - val_accuracy: 0.8999
Epoch 10/10
559/559 [==============================] - 1s 3ms/step - loss: 0.2366 - accuracy: 0.9020 - val_loss: 0.2397 - val_accuracy: 0.9034
559/559 [==============================] - 1s 1ms/step
187/187 [==============================] - 0s 1ms/step
Train Accuracy: 0.9127343968653793
Test Accuracy: 0.9034424853064652
*****************************************
CLASSIFICATION REPORT FOR TRAIN DATA
*****************************************
precision recall f1-score support
0 0.8639 0.5747 0.6903 2817
1 0.9190 0.9850 0.9509 14389
2 0.9033 0.7800 0.8371 659
accuracy 0.9127 17865
macro avg 0.8954 0.7799 0.8261 17865
weighted avg 0.9097 0.9127 0.9056 17865
*****************************************
CLASSIFICATION REPORT FOR TEST DATA
*****************************************
precision recall f1-score support
0 0.8290 0.5421 0.6555 939
1 0.9129 0.9831 0.9467 4796
2 0.8864 0.7091 0.7879 220
accuracy 0.9034 5955
macro avg 0.8761 0.7448 0.7967 5955
weighted avg 0.8987 0.9034 0.8949 5955
# Plot the training and validation accuracy values
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()
# Plot the training and validation loss values
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()
Both training and test accuracies have increased, and the gap between them has also decreased, which is a good sign. Additionally, looking at the classification report, we can see that the precision, recall, and F1-scores have also improved for all three classes in the test data.
scores
| Model | Train_Accuracy | Train_Recall | Test_Accuracy | Test_Recall | |
|---|---|---|---|---|---|
| 0 | GaussianNB | 0.692079 | 0.718431 | 0.688665 | 0.710147 |
| 1 | Decisiontree | 0.987629 | 0.984834 | 0.877750 | 0.787690 |
| 2 | randomforest | 0.987573 | 0.976436 | 0.907976 | 0.765475 |
| 3 | SVMCLASSIFIER | 0.896334 | 0.741612 | 0.896054 | 0.734057 |
| 4 | XGBOOST | 0.954044 | 0.905529 | 0.918220 | 0.815276 |
| 5 | AdaBoostClassifier | 0.738595 | 0.470149 | 0.739043 | 0.454156 |
| 6 | GBM | 0.919619 | 0.809527 | 0.911167 | 0.780735 |
| 7 | KNN | 0.901427 | 0.716692 | 0.872376 | 0.658747 |
| 8 | ANN-CLASSIFICATION | 0.912734 | 0.779894 | 0.903442 | 0.744756 |
scores.to_csv('basemodels',index=False)
1)ANN-CLASSIFIER has been by far the best model as generalized model was built with the help of ann model.And morover overall recall score is good and there is a scope of improvment too,which we will do with the help of hyperparamter tuning.
2)DECISIONTREE,RANDOMFOREST,XGBOOST,KNN:-This models are overfitting they might not perform well when compared to other models when unseen data is fed into them.we will do hyperparamter tuning for all this models and decide which model to select.
3)Naivebayesclssifier,Adaboost classifier:-This models are underfitting and may be there perfomance might be increased when there are more examples in training data sets.
4)GBM-This model is learning patterns which can be useful when prediciting unseen data.it is a genralized model as we ca observe from learning curve
5)SVM CLASSIFIER-This model is learning patterns which can be useful when prediciting unseen data.it is a genralized model as we can observe from learning curves.Only problem with svm is it is computationally inefficient.
For next set of scoring we will choose ANN CLASSIFIER,GBM,SVM CLASSIFIER and randomforest,xgboost. so we will build models using different hyperparameters.and check which one is perfoming better.
scores1 = pd.DataFrame(columns=['Model','Train_Accuracy','Train_Recall','Train_Precision','Train_F1_Score','Test_Accuracy','Test_Recall','Test_Precision','Test_F1_Score'])
def get_metrics(train_actual,train_predicted,test_actual,test_predicted,model_description,dataframe):
train_accuracy = accuracy_score(train_actual,train_predicted)
train_recall = recall_score(train_actual,train_predicted,average="macro")
train_precision= precision_score(train_actual,train_predicted,average="macro")
train_f1score = f1_score(train_actual,train_predicted,average="macro")
test_accuracy = accuracy_score(test_actual,test_predicted)
test_recall = recall_score(test_actual,test_predicted,average="macro")
test_precision= precision_score(test_actual,test_predicted,average="macro")
test_f1score = f1_score(test_actual,test_predicted,average="macro")
dataframe = dataframe.append(pd.Series([model_description, train_accuracy,train_recall,train_precision,train_f1score,
test_accuracy,test_recall,test_precision,test_f1score],
index=scores1.columns ), ignore_index=True)
return(dataframe)
For validation purpose we will use stratified k fold technique,Stratified k-fold cross-validation is a variation of k-fold cross-validation that aims to ensure that each fold is representative of the overall distribution of the target variable. This is particularly useful when the dataset is imbalanced, meaning that some classes may have significantly fewer samples than others.
For parameter tuning we will use randomized search cv as grid search cv is time-consuming and computationally expensive, especially if the hyperparameters are too many..
# Define the stratified k-fold cross-validator
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
from sklearn.model_selection import RandomizedSearchCV
Lets start randomized search cv with our overfit models first
#after running randomized search cv and getting results from it we are going to store it as a text to ensure lesser run time.
#Decision Tree with RandomizedSearchCV - Hyper-parameter Tuning
# set of parameters to test
param_grid = {"class_weight":['balanced', None],
"criterion": ["gini", "entropy"],
"max_depth": [3, 5, 6],
"min_samples_leaf": [2, 5, 10],
"max_leaf_nodes": [None, 5, 10, 20]
}
dt = DecisionTreeClassifier(random_state=123)
clf_dt_grid = RandomizedSearchCV(dt, param_grid,cv=skf,scoring='recall')
clf_dt_grid.fit(X_train, y_train)
print(clf_dt_grid.best_params_)
output:-{'min_samples_leaf': 5, 'max_leaf_nodes': 20, 'max_depth': 6, 'criterion': 'gini', 'class_weight': 'balanced'}
clf_dt_grid = DecisionTreeClassifier(min_samples_leaf=10, max_leaf_nodes=30, max_depth=8, criterion='entropy', class_weight='balanced')
clf_dt_grid.fit(X=X_train, y= y_train)
y_pred_train = clf_dt_grid.predict(X_train)
y_pred_test = clf_dt_grid.predict(X_test)
print("Train Accuracy:", accuracy_score(y_train,y_pred_train))
# print("Train Classification Report:")
# print(classification_report(y_train,y_pred_train,digits=4))
print("Test Accuracy:",accuracy_score(y_test,y_pred_test))
# print("Test Classification Report:")
#print(classification_report(y_test,y_pred_test,digits=4))
classifcation_report_train_test(y_train,y_pred_train,y_test,y_pred_test)
Train Accuracy: 0.8408060453400503
Test Accuracy: 0.8397984886649874
*****************************************
CLASSIFICATION REPORT FOR TRAIN DATA
*****************************************
precision recall f1-score support
0 0.5844 0.7568 0.6596 2817
1 0.9527 0.8527 0.9000 14389
2 0.4626 0.9393 0.6199 659
accuracy 0.8408 17865
macro avg 0.6666 0.8496 0.7265 17865
weighted avg 0.8766 0.8408 0.8517 17865
*****************************************
CLASSIFICATION REPORT FOR TEST DATA
*****************************************
precision recall f1-score support
0 0.5774 0.7742 0.6615 939
1 0.9568 0.8503 0.9004 4796
2 0.4516 0.8909 0.5994 220
accuracy 0.8398 5955
macro avg 0.6620 0.8385 0.7204 5955
weighted avg 0.8783 0.8398 0.8516 5955
scores1= get_metrics(y_train,y_pred_train,y_test,y_pred_test,"DecisionTree_BestParameters",scores1)
scores1
| Model | Train_Accuracy | Train_Recall | Train_Precision | Train_F1_Score | Test_Accuracy | Test_Recall | Test_Precision | Test_F1_Score | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | DecisionTree_BestParameters | 0.840806 | 0.849623 | 0.666591 | 0.726479 | 0.839798 | 0.838476 | 0.661961 | 0.720439 |
from sklearn.model_selection import learning_curve
train_sizes, train_scores, test_scores = learning_curve(clf_dt_grid, X_train, y_train, cv=skf, scoring='accuracy', n_jobs=-1, train_sizes=np.linspace(0.1, 1.0, 10))
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.figure(figsize=(8,6))
plt.grid()
plt.title("Learning Curve")
plt.xlabel("Training Examples")
plt.ylabel("Score")
plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="y", label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="b", label="Cross-validation score")
plt.legend(loc="best")
plt.show()
#from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, truncnorm, randint
clf3_dt = DecisionTreeClassifier(random_state=123, class_weight='balanced')
max_leaf_nodes = np.random.normal(loc=5, scale=1, size=5).astype(int)
max_leaf_nodes[max_leaf_nodes <1] = 1
print(max_leaf_nodes)
max_depth = np.random.uniform(2,5,4).astype(int)
print(max_depth)
min_samples_split = np.random.uniform(2, 6, 5).astype(int)
print(min_samples_split)
## Set Up Hyperparameter Distributions
# normally distributed max_leaf_nodes, with mean 5 stddev 1
max_leaf_nodes = np.random.normal(loc=5, scale=1, size=5).astype(int)
# uniform distribution from 2 to 5
max_depth = np.random.uniform(2,5,4).astype(int)
# uniform distribution from 2 to 6
min_samples_split = np.random.uniform(2, 6, 5).astype(int)
model_params = {
'max_depth': list(max_depth),
'max_leaf_nodes': list(max_leaf_nodes),
'min_samples_split': list(min_samples_split)
}
clf_random = RandomizedSearchCV(estimator=clf3_dt, param_distributions=model_params, n_iter=600, cv=skf, scoring='recall', n_jobs=-1)
clf_random.fit(X_train, y_train)
print(clf_random.best_score_, clf_random.best_params_)
clf_dt_random = clf_random.best_estimator_
print(clf_dt_random)
DecisionTreeClassifier(class_weight='balanced', max_depth=4, max_leaf_nodes=5, min_samples_split=3, random_state=123)
clf_dt_grid2 = DecisionTreeClassifier(class_weight='balanced', max_depth=3, max_leaf_nodes=5,
min_samples_split=3, random_state=123)
clf_dt_grid2.fit(X=X_train, y= y_train)
y_pred_train = clf_dt_grid2.predict(X_train)
y_pred_test = clf_dt_grid2.predict(X_test)
print("Train Accuracy:", accuracy_score(y_train,y_pred_train))
# print("Train Classification Report:")
# print(classification_report(y_train,y_pred_train,digits=4))
print("Test Accuracy:",accuracy_score(y_test,y_pred_test))
# print("Test Classification Report:")
#print(classification_report(y_test,y_pred_test,digits=4))
classifcation_report_train_test(y_train,y_pred_train,y_test,y_pred_test)
Train Accuracy: 0.6828995242093479
Test Accuracy: 0.6896725440806045
*****************************************
CLASSIFICATION REPORT FOR TRAIN DATA
*****************************************
precision recall f1-score support
0 0.3269 0.8882 0.4779 2817
1 0.9755 0.6403 0.7731 14389
2 0.6315 0.7360 0.6797 659
accuracy 0.6829 17865
macro avg 0.6447 0.7548 0.6436 17865
weighted avg 0.8606 0.6829 0.7231 17865
*****************************************
CLASSIFICATION REPORT FOR TEST DATA
*****************************************
precision recall f1-score support
0 0.3307 0.8829 0.4811 939
1 0.9704 0.6499 0.7785 4796
2 0.6822 0.7318 0.7061 220
accuracy 0.6897 5955
macro avg 0.6611 0.7549 0.6552 5955
weighted avg 0.8589 0.6897 0.7289 5955
scores1= get_metrics(y_train,y_pred_train,y_test,y_pred_test,"DecisionTree_BestParameters-2",scores1)
scores1
| Model | Train_Accuracy | Train_Recall | Train_Precision | Train_F1_Score | Test_Accuracy | Test_Recall | Test_Precision | Test_F1_Score | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | DecisionTree_BestParameters | 0.840806 | 0.849623 | 0.666591 | 0.726479 | 0.839798 | 0.838476 | 0.661961 | 0.720439 |
| 1 | DecisionTree_BestParameters-2 | 0.682900 | 0.754808 | 0.644660 | 0.643605 | 0.689673 | 0.754863 | 0.661100 | 0.655250 |
Its not a generalized model like our previous one,we will not use this model..
2)Random forest with hyperparamter tuning
param_grid = {#"n_estimators" : [50,60],
"criterion": ["gini", "entropy"],
"min_samples_split": [2, 5],
"max_depth": [None, 7],
"min_samples_leaf": [1, 5],
"ccp_alpha":[0.0001,0.001,0.01,0.1]
}
rfclf2 = RandomForestClassifier()
rfclf_grid = RandomizedSearchCV(rfclf2, param_grid, cv=skf)
rfclf_grid.fit(X_train, y_train)
print(rfclf_grid.best_params_)
rfclf_grid = rfclf_grid.best_estimator_
print(rfclf_grid)
RandomForestClassifier(ccp_alpha=0.0001, min_samples_split=5)
rfclf_grid =RandomForestClassifier(ccp_alpha=0.0001, min_samples_split=5)
rfclf_grid.fit(X=X_train, y= y_train)
y_pred_train = rfclf_grid.predict(X_train)
y_pred_test = rfclf_grid.predict(X_test)
print("Train Accuracy:", accuracy_score(y_train,y_pred_train))
# print("Train Classification Report:")
# print(classification_report(y_train,y_pred_train,digits=4))
print("Test Accuracy:",accuracy_score(y_test,y_pred_test))
# print("Test Classification Report:")
#print(classification_report(y_test,y_pred_test,digits=4))
classifcation_report_train_test(y_train,y_pred_train,y_test,y_pred_test)
Train Accuracy: 0.9399944024629163
Test Accuracy: 0.9106633081444164
*****************************************
CLASSIFICATION REPORT FOR TRAIN DATA
*****************************************
precision recall f1-score support
0 0.9432 0.6954 0.8006 2817
1 0.9382 0.9931 0.9649 14389
2 0.9784 0.8255 0.8955 659
accuracy 0.9400 17865
macro avg 0.9533 0.8380 0.8870 17865
weighted avg 0.9404 0.9400 0.9364 17865
*****************************************
CLASSIFICATION REPORT FOR TEST DATA
*****************************************
precision recall f1-score support
0 0.8628 0.5761 0.6909 939
1 0.9161 0.9856 0.9496 4796
2 0.9226 0.7045 0.7990 220
accuracy 0.9107 5955
macro avg 0.9005 0.7554 0.8132 5955
weighted avg 0.9079 0.9107 0.9032 5955
scores1= get_metrics(y_train,y_pred_train,y_test,y_pred_test,"RandomForestClassifier",scores1)
scores1
| Model | Train_Accuracy | Train_Recall | Train_Precision | Train_F1_Score | Test_Accuracy | Test_Recall | Test_Precision | Test_F1_Score | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | DecisionTree_BestParameters | 0.840806 | 0.849623 | 0.666591 | 0.726479 | 0.839798 | 0.838476 | 0.661961 | 0.720439 |
| 1 | DecisionTree_BestParameters-2 | 0.682900 | 0.754808 | 0.644660 | 0.643605 | 0.689673 | 0.754863 | 0.661100 | 0.655250 |
| 2 | RandomForestClassifier | 0.939994 | 0.838011 | 0.953254 | 0.886967 | 0.910663 | 0.755434 | 0.900514 | 0.813160 |
its clearly a overfitting model
Lets build random forest using variable importance
importances = rfclf_grid.feature_importances_
importances
indices = np.argsort(importances)[::-1]
print(indices)
select = indices[0:30]
print(select)
rfclf4 = RandomForestClassifier(criterion= 'entropy',
max_depth= 3,
min_samples_leaf= 1,
min_samples_split= 2,
ccp_alpha=0.001)
rfclf4 = rfclf4.fit(X_train.values[:,select], y_train)
y_pred_train = rfclf4.predict(X_train.values[:,select])
y_pred_test = rfclf4.predict(X_test.values[:,select])
print("Train Accuracy:", accuracy_score(y_train,y_pred_train))
# print("Train Classification Report:")
# print(classification_report(y_train,y_pred_train,digits=4))
print("Test Accuracy:",accuracy_score(y_test,y_pred_test))
# print("Test Classification Report:")
#print(classification_report(y_test,y_pred_test,digits=4))
classifcation_report_train_test(y_train,y_pred_train,y_test,y_pred_test)
[ 0 11 14 9 8 12 10 13 4 7 21 17 6 29 3 15 2 1 16 31 5 22 30 37
35 44 53 87 45 38 70 43 73 56 18 52 32 20 19 26 27 24 36 39 69 41 71 25
34 90 72 86 88 59 60 92 62 74 54 91 93 58 55 79 89 96 57 61 94 33 75 95
76 66 77 49 78 83 28 50 51 80 82 84 85 68 47 81 23 67 40 46 64 42 48 65
63]
[ 0 11 14 9 8 12 10 13 4 7 21 17 6 29 3 15 2 1 16 31 5 22 30 37
35 44 53 87 45 38]
Train Accuracy: 0.8784214945424014
Test Accuracy: 0.8779177162048698
*****************************************
CLASSIFICATION REPORT FOR TRAIN DATA
*****************************************
precision recall f1-score support
0 0.9466 0.3269 0.4860 2817
1 0.8743 0.9998 0.9329 14389
2 0.8813 0.5857 0.7037 659
accuracy 0.8784 17865
macro avg 0.9007 0.6375 0.7075 17865
weighted avg 0.8860 0.8784 0.8539 17865
*****************************************
CLASSIFICATION REPORT FOR TEST DATA
*****************************************
precision recall f1-score support
0 0.9560 0.3237 0.4837 939
1 0.8730 1.0000 0.9322 4796
2 0.8951 0.5818 0.7052 220
accuracy 0.8779 5955
macro avg 0.9080 0.6352 0.7070 5955
weighted avg 0.8869 0.8779 0.8531 5955
Its an overfitting model
#clf_XGB_grid = XGBClassifier()
# Use a grid over parameters of interest
param_grid = {
'colsample_bytree': np.linspace(0.6, 0.8, 2),
'n_estimators':[100, 200],
'max_depth': [3, 4],
'gamma': [0.2,0.3,0.4],
'learning_rate': [0.001, 0.01, 0.1, 1, 10],
'scale_pos_weight':[4]
}
CV_XGB = RandomizedSearchCV(estimator=clf_XGB_grid, param_distributions=param_grid, n_jobs=-1,cv=skf, scoring='recall')
%time CV_XGB.fit(X = X_train, y=y_train)
best_xgb_model = CV_XGB.best_estimator_
print (CV_XGB.best_score_, CV_XGB.best_params_)
CV_XGB = CV_XGB.best_estimator_
print(CV_XGB)
{'scale_pos_weight': 4, 'n_estimators': 200, 'max_depth': 4, 'learning_rate': 0.1, 'gamma': 0.4, 'colsample_bytree': 0.6}
CV_XGB =XGBClassifier(scale_pos_weight=4,n_estimators=200,max_depth=4,learning_rate=0.1,gamma=0.4,colsample_bytree=0.6)
CV_XGB.fit(X=X_train, y= y_train)
y_pred_train = CV_XGB.predict(X_train)
y_pred_test = CV_XGB.predict(X_test)
print("Train Accuracy:", accuracy_score(y_train,y_pred_train))
# print("Train Classification Report:")
# print(classification_report(y_train,y_pred_train,digits=4))
print("Test Accuracy:",accuracy_score(y_test,y_pred_test))
# print("Test Classification Report:")
#print(classification_report(y_test,y_pred_test,digits=4))
classifcation_report_train_test(y_train,y_pred_train,y_test,y_pred_test)
[05:22:17] WARNING: ../src/learner.cc:767:
Parameters: { "scale_pos_weight" } are not used.
Train Accuracy: 0.9280156731038343
Test Accuracy: 0.9163727959697733
*****************************************
CLASSIFICATION REPORT FOR TRAIN DATA
*****************************************
precision recall f1-score support
0 0.8856 0.6486 0.7488 2817
1 0.9313 0.9842 0.9570 14389
2 0.9916 0.8953 0.9410 659
accuracy 0.9280 17865
macro avg 0.9362 0.8427 0.8823 17865
weighted avg 0.9263 0.9280 0.9236 17865
*****************************************
CLASSIFICATION REPORT FOR TEST DATA
*****************************************
precision recall f1-score support
0 0.8509 0.6198 0.7172 939
1 0.9237 0.9796 0.9508 4796
2 0.9568 0.8045 0.8741 220
accuracy 0.9164 5955
macro avg 0.9104 0.8013 0.8474 5955
weighted avg 0.9134 0.9164 0.9111 5955
scores1= get_metrics(y_train,y_pred_train,y_test,y_pred_test,"XGB_BestParameters",scores1)
scores1
| Model | Train_Accuracy | Train_Recall | Train_Precision | Train_F1_Score | Test_Accuracy | Test_Recall | Test_Precision | Test_F1_Score | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | DecisionTree_BestParameters | 0.840806 | 0.849623 | 0.666591 | 0.726479 | 0.839798 | 0.838476 | 0.661961 | 0.720439 |
| 1 | DecisionTree_BestParameters-2 | 0.682900 | 0.754808 | 0.644660 | 0.643605 | 0.689673 | 0.754863 | 0.661100 | 0.655250 |
| 2 | RandomForestClassifier | 0.939994 | 0.838011 | 0.953254 | 0.886967 | 0.910663 | 0.755434 | 0.900514 | 0.813160 |
| 3 | XGB_BestParameters | 0.928016 | 0.842694 | 0.936161 | 0.882260 | 0.916373 | 0.801307 | 0.910449 | 0.847361 |
#Gradient boosting with hyperparamter tuning
# define the hyperparameter space
param_dist = {'learning_rate': [0.1, 0.05, 0.01],
'max_depth': [2,4,6],
'max_features': ['auto', 'sqrt', 'log2'],}
# define the classifier
clf = GradientBoostingClassifier()
# define the random search with cross validation
n_iter_search = 20
random_search = RandomizedSearchCV(clf, param_distributions=param_dist, n_iter=n_iter_search, cv=skf,scoring='recall')
# fit the random search to the data
random_search.fit(X_train, y_train)
# print the best hyperparameters and score
print("Best hyperparameters: ", random_search.best_params_)
print("Best score: ", random_search.best_score_)
Best hyperparameters: {'max_features': 'sqrt', 'max_depth': 4, 'learning_rate': 0.01}
clf_GBM = GradientBoostingClassifier(max_features='sqrt',max_depth=4,learning_rate=0.01)
clf_GBM.fit(X=X_train, y=y_train)
y_pred_train = clf_GBM.predict(X_train)
y_pred_test = clf_GBM.predict(X_test)
print("Train Accuracy:", accuracy_score(y_train,y_pred_train))
# print("Train Classification Report:")
# print(classification_report(y_train,y_pred_train,digits=4))
print("Test Accuracy:",accuracy_score(y_test,y_pred_test))
# print("Test Classification Report:")
#print(classification_report(y_test,y_pred_test,digits=4))
classifcation_report_train_test(y_train,y_pred_train,y_test,y_pred_test)
scores = get_metrics(y_train,y_pred_train,y_test,y_pred_test,"GBM",scores)
Train Accuracy: 0.8774139378673383
Test Accuracy: 0.8770780856423174
*****************************************
CLASSIFICATION REPORT FOR TRAIN DATA
*****************************************
precision recall f1-score support
0 0.9361 0.3326 0.4908 2817
1 0.8723 1.0000 0.9318 14389
2 0.9458 0.5296 0.6790 659
accuracy 0.8774 17865
macro avg 0.9181 0.6207 0.7005 17865
weighted avg 0.8851 0.8774 0.8529 17865
*****************************************
CLASSIFICATION REPORT FOR TEST DATA
*****************************************
precision recall f1-score support
0 0.9426 0.3323 0.4913 939
1 0.8715 1.0000 0.9314 4796
2 0.9504 0.5227 0.6745 220
accuracy 0.8771 5955
macro avg 0.9215 0.6183 0.6991 5955
weighted avg 0.8856 0.8771 0.8525 5955
Its not a overfitting model but a generalized model in which recall of 0,2 are very less,our model is good in predicting 1 label but not 0 and 2
scores1= get_metrics(y_train,y_pred_train,y_test,y_pred_test,"GradientBoostingClassifier",scores1)
scores1
| Model | Train_Accuracy | Train_Recall | Train_Precision | Train_F1_Score | Test_Accuracy | Test_Recall | Test_Precision | Test_F1_Score | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | DecisionTree_BestParameters | 0.840806 | 0.849623 | 0.666591 | 0.726479 | 0.839798 | 0.838476 | 0.661961 | 0.720439 |
| 1 | DecisionTree_BestParameters-2 | 0.682900 | 0.754808 | 0.644660 | 0.643605 | 0.689673 | 0.754863 | 0.661100 | 0.655250 |
| 2 | RandomForestClassifier | 0.939994 | 0.838011 | 0.953254 | 0.886967 | 0.910663 | 0.755434 | 0.900514 | 0.813160 |
| 3 | XGB_BestParameters | 0.928016 | 0.842694 | 0.936161 | 0.882260 | 0.916373 | 0.801307 | 0.910449 | 0.847361 |
| 4 | GradientBoostingClassifier | 0.877414 | 0.620738 | 0.918063 | 0.700544 | 0.877078 | 0.618332 | 0.921512 | 0.699059 |
*out of all model with hyperparamter tuning best model is {'min_samples_leaf': 5, 'max_leaf_nodes': 20, 'max_depth': 6, 'criterion': 'gini', 'class_weight': 'balanced'}
from sklearn.tree import export_text
rules = export_text(clf_dt_grid, feature_names=X_train.columns.tolist())
# Sort rules by importance scores and select the top 20 rules
rule_scores = clf_dt_grid.feature_importances_
rule_list = rules.split('\n')
rule_list.pop() # Remove the last empty element
rule_importance = [(rule, score) for rule, score in zip(rule_list, rule_scores)]
rule_importance_sorted = sorted(rule_importance, key=lambda x: x[1], reverse=True)
top_20_rules = [rule for rule, score in rule_importance_sorted[:20]]
# Print the top 20 rules
print("Top 20 rules:")
for rule in top_20_rules:
print(rule)
Top 20 rules: |--- sumoforderquantity <= -0.11 | | | | |--- recency <= 0.24 | | | | | | |--- class: 1 | | | | | | |--- num_transactions <= 0.01 | | | | | | |--- class: 0 | | | |--- duration_of_customer_relation <= -0.48 | | | | | |--- response_rate > 3.29 | | |--- AutomaticRefill_yes <= 0.50 | | | | |--- recency > 0.24 | | | | | | |--- numofemails <= -0.85 | | | | | | | |--- open_rate_ratio <= 0.55 | | | | | |--- class: 0 | | | | |--- Registration_month_12 > 0.50 | | | | | |--- days_since_first_transaction > -0.35 | | | | |--- class: 1 | | | | | |--- days_since_first_transaction <= -0.35 | | | | | | | |--- open_rate_ratio > 0.55 | |--- numofemails <= -0.29 | | | | | |--- OnlineCommunication_yes <= 0.50 | | | | | |--- OnlineCommunication_yes > 0.50
customer=pd.read_excel("/content/Customerdata_Test.xlsx")
test=pd.read_excel("/content/Test.xlsx")
transaction=pd.read_excel("/content/transactionsdata_Test.xlsx")
email=pd.read_excel("/content/Emaildata_Test.xlsx")
#checking all the datasets shape to get a proper understanding
print("shape of customer",customer.shape)
print("shape of train",test.shape)
print("shape of transaction",transaction.shape)
print("shape of email",email.shape)
shape of customer (5955, 7) shape of train (5955, 1) shape of transaction (26768, 4) shape of email (129074, 6)
#customer.head()
customer["CustomerID"].nunique()
#checking unique values for all the customer ids
5955
#test.head()
test["CustomerID"].nunique()
#checking unique values for all the customer ids
5955
#transaction['DateOfOrder'] = pd.to_datetime(transaction['DateOfOrder'])
transaction["CustomerID"].nunique()
#checking unique values for all the customer ids
5955
# combine the 'Dateoforder' and 'Timestamp' columns into a single datetime column
transaction['last_transaction_date'] = pd.to_datetime(transaction['DateOfOrder'] +' '+transaction['timestamp'])
transaction['DateOfOrder']=pd.to_datetime(transaction['DateOfOrder'])
# Sort the dataset by customer ID and purchase date
df3 = transaction.sort_values(['CustomerID', 'DateOfOrder'])
# Calculate the time between purchases for each customer
df3['time_between_purchases'] = df3.groupby('CustomerID')['DateOfOrder'].diff().dt.days.fillna(0)
df3['order_month']=pd.to_datetime(df3['DateOfOrder']).dt.month
# group the data by customer ID and month, and count the number of orders in each group
order_counts = df3.groupby(['CustomerID', 'order_month'])['OrderQuantity'].sum()
# use the unstack method to pivot the table so that the rows represent customers and the columns represent months
order_counts = order_counts.unstack(level=-1, fill_value=0)
# fill any missing values with 0
order_counts = order_counts.fillna(0)
order_counts.columns=["Totalorders-jan","Totalorders-feb","Totalorders-march","Totalorders-april","Totalorders-may","Totalorders-june","Totalorders-july","Totalorders-august","Totalorders-september","Totalorders-october","Totalorders-november","Totalorders-december"]
order_counts=order_counts.reset_index(drop=False)
order_counts
| CustomerID | Totalorders-jan | Totalorders-feb | Totalorders-march | Totalorders-april | Totalorders-may | Totalorders-june | Totalorders-july | Totalorders-august | Totalorders-september | Totalorders-october | Totalorders-november | Totalorders-december | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | C12121 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 52 |
| 1 | C12132 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 40 | 0 | 0 |
| 2 | C12134 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 40 | 0 | 0 | 0 |
| 3 | C12135 | 41 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | C12136 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 79 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5950 | C42153 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 82 | 297 | 0 |
| 5951 | C42155 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 41 | 0 | 0 | 0 |
| 5952 | C42157 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 41 | 0 | 0 | 0 |
| 5953 | C42167 | 0 | 0 | 90 | 74 | 0 | 0 | 0 | 0 | 0 | 187 | 151 | 0 |
| 5954 | C42190 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 40 | 0 | 0 | 0 |
5955 rows × 13 columns
# Group the data by customer ID and calculate the number of transactions, total amount spent, and last transaction date
grouped_data = df3.groupby('CustomerID').agg({
'CustomerID': 'count', # Number of transactions
'OrderQuantity': ['sum','min','max'],# Total orders
'last_transaction_date': ['min', 'max'],# first transaction date,Last transaction date
'time_between_purchases':'mean'#avaerage time between purchases
})
# Rename the columns to be more descriptive
grouped_data.columns = ['num_transactions','totalorders','minimumorderquantity','maximumorderquantity','first_transaction_date','last_transaction_date','average_time_between_purchase']
grouped_data['DateOfOrder'] = grouped_data['last_transaction_date'].dt.date.astype(str)
grouped_data['timestamp'] = grouped_data['last_transaction_date'].dt.time.apply(lambda x: x.strftime('%H:%M:%S'))
grouped_data["averageorders"]=grouped_data["totalorders"]//grouped_data["num_transactions"]
grouped_data = grouped_data.reset_index(drop=False)
# merge the grouped DataFrame back with the main DataFrame
transaction_data = pd.merge(df3, grouped_data, on=['CustomerID','last_transaction_date'],how="inner")
transaction_data = pd.merge(transaction_data, order_counts, on=['CustomerID'],how="inner")
transaction['last_transaction_date']
transaction = transaction.rename(columns={'last_transaction_date': 'transaction_date'})
transaction_data.shape
#rechecking for incorrect dtypes
#transaction_data["CustomerID"].nunique()
#rechecking for unique customerids
(5955, 28)
transaction_data.columns
Index(['CustomerID', 'DateOfOrder_x', 'timestamp_x', 'OrderQuantity',
'last_transaction_date', 'time_between_purchases', 'order_month',
'num_transactions', 'totalorders', 'minimumorderquantity',
'maximumorderquantity', 'first_transaction_date',
'average_time_between_purchase', 'DateOfOrder_y', 'timestamp_y',
'averageorders', 'Totalorders-jan', 'Totalorders-feb',
'Totalorders-march', 'Totalorders-april', 'Totalorders-may',
'Totalorders-june', 'Totalorders-july', 'Totalorders-august',
'Totalorders-september', 'Totalorders-october', 'Totalorders-november',
'Totalorders-december'],
dtype='object')
transaction_data=transaction_data.drop(['DateOfOrder_x', 'timestamp_x','DateOfOrder_y', 'timestamp_y','time_between_purchases','order_month'],axis=1)
transaction_data.head()
| CustomerID | OrderQuantity | last_transaction_date | num_transactions | totalorders | minimumorderquantity | maximumorderquantity | first_transaction_date | average_time_between_purchase | averageorders | Totalorders-jan | Totalorders-feb | Totalorders-march | Totalorders-april | Totalorders-may | Totalorders-june | Totalorders-july | Totalorders-august | Totalorders-september | Totalorders-october | Totalorders-november | Totalorders-december | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | C12121 | 52 | 2017-12-14 11:38:00 | 1 | 52 | 52 | 52 | 2017-12-14 11:38:00 | 0.0 | 52 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 52 |
| 1 | C12132 | 40 | 2013-10-20 17:52:44 | 1 | 40 | 40 | 40 | 2013-10-20 17:52:44 | 0.0 | 40 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 40 | 0 | 0 |
| 2 | C12134 | 40 | 2013-09-18 10:24:48 | 1 | 40 | 40 | 40 | 2013-09-18 10:24:48 | 0.0 | 40 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 40 | 0 | 0 | 0 |
| 3 | C12135 | 41 | 2018-01-10 19:33:58 | 1 | 41 | 41 | 41 | 2018-01-10 19:33:58 | 0.0 | 41 | 41 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | C12136 | 79 | 2013-10-19 20:51:18 | 1 | 79 | 79 | 79 | 2013-10-19 20:51:18 | 0.0 | 79 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 79 | 0 | 0 |
email.head()
| CustomerID | DateOfemail | EmailType | timestamp | MailOpened | MailClicked | |
|---|---|---|---|---|---|---|
| 0 | C17838 | 2010-10-22 | WelcomeEmail | 16:59:02 | yes | NotApplicable |
| 1 | C12403 | 2008-11-17 | WelcomeEmail | 12:46:56 | yes | NotApplicable |
| 2 | C19041 | 2011-01-24 | WelcomeEmail | 14:35:22 | yes | NotApplicable |
| 3 | C24069 | 2010-09-18 | WelcomeEmail | 12:25:26 | yes | NotApplicable |
| 4 | C26718 | 2010-10-02 | WelcomeEmail | 11:55:58 | yes | NotApplicable |
email.columns
Index(['CustomerID', 'DateOfemail', 'EmailType', 'timestamp', 'MailOpened',
'MailClicked'],
dtype='object')
# combine the 'Date of email' and 'Time' columns into a single datetime column
email['emailDatetime'] = pd.to_datetime(email['DateOfemail'] +' '+email['timestamp'])
email["EmailType"].value_counts()
CustomizedEmail 123119 WelcomeEmail 5955 Name: EmailType, dtype: int64
email.columns
Index(['CustomerID', 'DateOfemail', 'EmailType', 'timestamp', 'MailOpened',
'MailClicked', 'emailDatetime'],
dtype='object')
# group the data by customer ID and month, and count the number of times mail oponed or not for each customer
mail_open_counts = email.groupby(['CustomerID', 'MailOpened'])['emailDatetime'].count()
# use the unstack method to pivot the table so that the rows represent customers and the columns represent mail oponed
mail_open_counts = mail_open_counts.unstack(level=-1, fill_value=0)
# fill any missing values with 0
mail_open_counts = mail_open_counts.fillna(0)
mail_open_counts.columns=["Totalmails-notoponed","Totalmails-oponed"]
mail_open_counts=mail_open_counts.reset_index(drop=False)
# group the data by customer ID and month, and count the number of times mail oponed or not for each customer
MailClicked_counts = email.groupby(['CustomerID', 'MailClicked'])['emailDatetime'].count()
# use the unstack method to pivot the table so that the rows represent customers and the columns represent mail oponed
MailClicked_counts = MailClicked_counts.unstack(level=-1, fill_value=0)
# fill any missing values with 0
MailClicked_counts = MailClicked_counts.fillna(0)
MailClicked_counts.columns=["Notapplicableclickedmails","timesclicked-yes","timesclicked-no"]
MailClicked_counts=MailClicked_counts.reset_index(drop=False)
MailClicked_counts
| CustomerID | Notapplicableclickedmails | timesclicked-yes | timesclicked-no | |
|---|---|---|---|---|
| 0 | C12121 | 1 | 21 | 3 |
| 1 | C12132 | 1 | 0 | 0 |
| 2 | C12134 | 1 | 0 | 0 |
| 3 | C12135 | 1 | 24 | 5 |
| 4 | C12136 | 1 | 40 | 1 |
| ... | ... | ... | ... | ... |
| 5950 | C42153 | 1 | 9 | 1 |
| 5951 | C42155 | 1 | 0 | 0 |
| 5952 | C42157 | 1 | 0 | 0 |
| 5953 | C42167 | 1 | 48 | 1 |
| 5954 | C42190 | 1 | 0 | 0 |
5955 rows × 4 columns
#filtering customers who have received welcome mail but not a customized mail
welcome_mail = email[(email['EmailType'] == 'WelcomeEmail') & ~(email['CustomerID'].isin(email[email['EmailType'] == 'CustomizedEmail']['CustomerID'].unique()))]
welcome_mail["CustomerID"].nunique()
2457
welcome_mail['DateOfemail']=pd.to_datetime(welcome_mail['DateOfemail'])
# Sort the dataset by customer ID and purchase date
welcome_mail = welcome_mail.sort_values(['CustomerID', 'DateOfemail'])
# Calculate the time between purchases for each customer
welcome_mail['time_between_emails'] = welcome_mail.groupby('CustomerID')['DateOfemail'].diff().dt.days.fillna(0)
# Group the data by customer ID and calculate the number of emails sent to customer,first email sent, and last email date
grouped_data3 = welcome_mail.groupby('CustomerID').agg({
'CustomerID': 'count', # Number of emails
'emailDatetime': ['max','min'], # Last email date,first email date
'time_between_emails':'mean'
})
# Rename the columns to be more descriptive
grouped_data3.columns = ['numofemails','lastemailDatetime','firstemailDatetime','average_time_between_mails']
grouped_data3 = grouped_data3.reset_index(drop=False)
grouped_data3['DateOfemail'] = grouped_data3['lastemailDatetime'].dt.date.astype(str)
grouped_data3['timestamp'] = grouped_data3['lastemailDatetime'].dt.time.apply(lambda x: x.strftime('%H:%M:%S'))
welcome_mail['DateOfemail']=welcome_mail['DateOfemail'].dt.date.astype(str)
# merge the grouped DataFrame back with the main DataFrame
merged_data3 = pd.merge(welcome_mail, grouped_data3, on=['CustomerID','DateOfemail','timestamp'],how="inner")
#merged_data.shape
#merged_data["CustomerID"].nunique()
#merged_data.columns
merged_data3=merged_data3.drop(['DateOfemail','timestamp'],axis=1)
#filtering persons who received customized mail
customizedmail=email[email['EmailType'] == 'CustomizedEmail']
customizedmail['DateOfemail']=pd.to_datetime(customizedmail['DateOfemail'])
# Sort the dataset by customer ID and purchase date
customizedmail = customizedmail.sort_values(['CustomerID', 'DateOfemail'])
# Calculate the time between purchases for each customer
customizedmail['time_between_emails'] = customizedmail.groupby('CustomerID')['DateOfemail'].diff().dt.days.fillna(0)
# Group the data by customer ID and calculate the number of emails sent to customer,first email sent, and last email date
grouped_data2 = customizedmail.groupby('CustomerID').agg({
'CustomerID': 'count', # Number of emails
'emailDatetime': ['max','min'], # Last email date,first email date
'time_between_emails':'mean'
})
# Rename the columns to be more descriptive
grouped_data2.columns = ['numofemails','lastemailDatetime','firstemailDatetime','average_time_between_mails']
grouped_data2 = grouped_data2.reset_index(drop=False)
grouped_data2['DateOfemail'] = grouped_data2['lastemailDatetime'].dt.date.astype(str)
grouped_data2['timestamp'] = grouped_data2['lastemailDatetime'].dt.time.apply(lambda x: x.strftime('%H:%M:%S'))
customizedmail['DateOfemail']=customizedmail['DateOfemail'].dt.date.astype(str)
# merge the grouped DataFrame back with the main DataFrame
merged_data = pd.merge(customizedmail, grouped_data2, on=['CustomerID','DateOfemail','timestamp'],how="inner")
#merged_data.shape
#merged_data["CustomerID"].nunique()
#merged_data.columns
merged_data=merged_data.drop(['DateOfemail','timestamp'],axis=1)
mail_data= pd.concat([merged_data3,merged_data])
mail_data.shape
(5955, 10)
mail_data=mail_data.drop(["time_between_emails"],axis=1)
test=customer.merge(test, on="CustomerID").merge(mail_data, on="CustomerID").merge(transaction_data, on="CustomerID").merge(mail_open_counts, on="CustomerID").merge(MailClicked_counts, on="CustomerID")
test.shape
(5955, 41)
#checking for unique values in each column such that we could whether there are any incorrect dtypes
for i in test.columns:
print("no of unique values and dtype:","|",i,"|",test[i].nunique(),"|",test[i].dtypes)
no of unique values and dtype: | CustomerID | 5955 | object no of unique values and dtype: | City | 4 | object no of unique values and dtype: | DateOfRegistration | 1631 | datetime64[ns] no of unique values and dtype: | OnlineCommunication | 2 | int64 no of unique values and dtype: | AutomaticRefill | 2 | int64 no of unique values and dtype: | DoorstepDelivery | 2 | int64 no of unique values and dtype: | PreferredDeliveryDay | 7 | object no of unique values and dtype: | EmailType | 2 | object no of unique values and dtype: | MailOpened | 2 | object no of unique values and dtype: | MailClicked | 3 | object no of unique values and dtype: | emailDatetime | 5954 | datetime64[ns] no of unique values and dtype: | numofemails | 75 | int64 no of unique values and dtype: | lastemailDatetime | 5954 | datetime64[ns] no of unique values and dtype: | firstemailDatetime | 5955 | datetime64[ns] no of unique values and dtype: | average_time_between_mails | 2097 | float64 no of unique values and dtype: | OrderQuantity | 234 | int64 no of unique values and dtype: | last_transaction_date | 5953 | datetime64[ns] no of unique values and dtype: | num_transactions | 106 | int64 no of unique values and dtype: | totalorders | 1030 | int64 no of unique values and dtype: | minimumorderquantity | 221 | int64 no of unique values and dtype: | maximumorderquantity | 248 | int64 no of unique values and dtype: | first_transaction_date | 5954 | datetime64[ns] no of unique values and dtype: | average_time_between_purchase | 1038 | float64 no of unique values and dtype: | averageorders | 234 | int64 no of unique values and dtype: | Totalorders-jan | 381 | int64 no of unique values and dtype: | Totalorders-feb | 298 | int64 no of unique values and dtype: | Totalorders-march | 316 | int64 no of unique values and dtype: | Totalorders-april | 318 | int64 no of unique values and dtype: | Totalorders-may | 342 | int64 no of unique values and dtype: | Totalorders-june | 335 | int64 no of unique values and dtype: | Totalorders-july | 352 | int64 no of unique values and dtype: | Totalorders-august | 364 | int64 no of unique values and dtype: | Totalorders-september | 365 | int64 no of unique values and dtype: | Totalorders-october | 405 | int64 no of unique values and dtype: | Totalorders-november | 407 | int64 no of unique values and dtype: | Totalorders-december | 401 | int64 no of unique values and dtype: | Totalmails-notoponed | 64 | int64 no of unique values and dtype: | Totalmails-oponed | 52 | int64 no of unique values and dtype: | Notapplicableclickedmails | 1 | int64 no of unique values and dtype: | timesclicked-yes | 73 | int64 no of unique values and dtype: | timesclicked-no | 23 | int64
#converting dtypes of the following columns
test["OnlineCommunication"]=test["OnlineCommunication"].astype("object")
test["DoorstepDelivery"]=test["DoorstepDelivery"].astype("object")
test["AutomaticRefill"]=test["AutomaticRefill"].astype("object")
test["days_since_first_transaction"]=(test['last_transaction_date'] - test['first_transaction_date']).dt.days
test["email_duration"]=(test['lastemailDatetime']- test['firstemailDatetime']).dt.days
test["duration_of_customer_relation"]=(test['last_transaction_date'] - test['DateOfRegistration']).dt.days
test["recency"]=(test["last_transaction_date"].max()-test["last_transaction_date"]).dt.days
test["email_duration"]
0 0
1 1859
2 1058
3 50
4 1023
...
5950 0
5951 0
5952 0
5953 0
5954 0
Name: email_duration, Length: 5955, dtype: int64
test["email_duration"]
0 0
1 1859
2 1058
3 50
4 1023
...
5950 0
5951 0
5952 0
5953 0
5954 0
Name: email_duration, Length: 5955, dtype: int64
bins = pd.cut(test["days_since_first_transaction"], [-1, 1, 10, 100, test["days_since_first_transaction"].max()], labels=["Only transaction", "Recent Transactions", "Moderate History", "Long-term History"])
test["days_since_first_transaction_category"] = bins
test["days_since_first_transaction_category"].value_counts()
Only transaction 3748 Moderate History 1021 Long-term History 987 Recent Transactions 199 Name: days_since_first_transaction_category, dtype: int64
bins = pd.cut(test["email_duration"], [-1, 1, 10, 100, test["email_duration"].max()], labels=["Only email", "Medium history", "long mail history", "very Long mail History"])
test["email_duration_category"] = bins
test["email_duration_category"].value_counts()
Only email 2790 long mail history 1496 very Long mail History 1176 Medium history 493 Name: email_duration_category, dtype: int64
test["email_duration_category"]=test["email_duration_category"].astype("object")
test["days_since_first_transaction_category"]=test["days_since_first_transaction_category"].astype("object")
#as we had already binned this columns so we will drop them
#test=test.drop(["days_since_first_transaction","email_duration"],axis=1)
test.columns
Index(['CustomerID', 'City', 'DateOfRegistration', 'OnlineCommunication',
'AutomaticRefill', 'DoorstepDelivery', 'PreferredDeliveryDay',
'EmailType', 'MailOpened', 'MailClicked', 'emailDatetime',
'numofemails', 'lastemailDatetime', 'firstemailDatetime',
'average_time_between_mails', 'OrderQuantity', 'last_transaction_date',
'num_transactions', 'totalorders', 'minimumorderquantity',
'maximumorderquantity', 'first_transaction_date',
'average_time_between_purchase', 'averageorders', 'Totalorders-jan',
'Totalorders-feb', 'Totalorders-march', 'Totalorders-april',
'Totalorders-may', 'Totalorders-june', 'Totalorders-july',
'Totalorders-august', 'Totalorders-september', 'Totalorders-october',
'Totalorders-november', 'Totalorders-december', 'Totalmails-notoponed',
'Totalmails-oponed', 'Notapplicableclickedmails', 'timesclicked-yes',
'timesclicked-no', 'days_since_first_transaction', 'email_duration',
'duration_of_customer_relation', 'recency',
'days_since_first_transaction_category', 'email_duration_category'],
dtype='object')
test["Totalorderquantity_in_first_quarter"]=test["Totalorders-jan"]+test["Totalorders-feb"]+test["Totalorders-march"]+test["Totalorders-april"]
test["Totalorderquantity_in_second_quarter"]=test["Totalorders-may"]+test["Totalorders-june"]+test["Totalorders-july"]+test["Totalorders-august"]
test["Totalorderquantity_in_Third_quarter"]=test["Totalorders-december"]+test["Totalorders-september"]+test["Totalorders-october"]+test["Totalorders-november"]
test=test.drop(['Totalorders-jan', 'Totalorders-feb',
'Totalorders-march', 'Totalorders-april', 'Totalorders-may',
'Totalorders-june', 'Totalorders-july', 'Totalorders-august',
'Totalorders-september', 'Totalorders-october', 'Totalorders-november',
'Totalorders-december'],axis=1)
test["OnlineCommunication"].value_counts()
# map OnlineCommunication to new categories
new_categories = {
0: 'no',
1: 'yes',
}
test['OnlineCommunication'] = test['OnlineCommunication'].map(new_categories)
test['OnlineCommunication']=test['OnlineCommunication'].astype("category")
# print the resulting dataframe
print(test['OnlineCommunication'].value_counts())
yes 3926 no 2029 Name: OnlineCommunication, dtype: int64
#test["AutomaticRefill"].value_counts()
# map AutomaticRefill to new categories
new_categories = {
0: 'no',
1: 'yes',
}
test['AutomaticRefill'] = test['AutomaticRefill'].map(new_categories)
# print the resulting dataframe
print(test['AutomaticRefill'].value_counts())
test['AutomaticRefill']=test['AutomaticRefill'].astype("category")
no 5389 yes 566 Name: AutomaticRefill, dtype: int64
#test["DoorstepDelivery"].value_counts()
# map DoorstepDelivery to new categories
new_categories = {
0: 'no',
1: 'yes',
}
test['DoorstepDelivery'] = test['DoorstepDelivery'].map(new_categories)
test['DoorstepDelivery']=test['DoorstepDelivery'].astype("category")
# print the resulting dataframe
print(test['DoorstepDelivery'].value_counts())
no 5723 yes 232 Name: DoorstepDelivery, dtype: int64
test["PreferredDeliveryDay"].value_counts()
# map PreferredDeliveryDay to new categories
new_categories = {
'Monday': 'Monday',
'Tuesday':'Tuesday',
'Wednesday':'Wednesday',
'Thursday':'Thursday',
'Friday':'Friday',
'Saturday':'weekend',
'Sunday':'weekend'
}
test['PreferredDeliveryDay'] = test['PreferredDeliveryDay'].map(new_categories)
test['PreferredDeliveryDay']=test['PreferredDeliveryDay'].astype("category")
# print the resulting dataframe
print(test['PreferredDeliveryDay'].value_counts())
Monday 1350 Tuesday 1311 Friday 1052 Thursday 1008 Wednesday 968 weekend 266 Name: PreferredDeliveryDay, dtype: int64
test["Notapplicableclickedmails"].value_counts()#there is only one value present in whole hence we can remove this column.
test=test.drop(["Notapplicableclickedmails"],axis=1)
categoricalcolumns=train.select_dtypes(include='object')
test.dtypes
CustomerID object City object DateOfRegistration datetime64[ns] OnlineCommunication category AutomaticRefill category DoorstepDelivery category PreferredDeliveryDay category EmailType object MailOpened object MailClicked object emailDatetime datetime64[ns] numofemails int64 lastemailDatetime datetime64[ns] firstemailDatetime datetime64[ns] average_time_between_mails float64 OrderQuantity int64 last_transaction_date datetime64[ns] num_transactions int64 totalorders int64 minimumorderquantity int64 maximumorderquantity int64 first_transaction_date datetime64[ns] average_time_between_purchase float64 averageorders int64 Totalmails-notoponed int64 Totalmails-oponed int64 timesclicked-yes int64 timesclicked-no int64 days_since_first_transaction int64 email_duration int64 duration_of_customer_relation int64 recency int64 days_since_first_transaction_category object email_duration_category object Totalorderquantity_in_first_quarter int64 Totalorderquantity_in_second_quarter int64 Totalorderquantity_in_Third_quarter int64 dtype: object
for i in categoricalcolumns:
test[i]=test_columns[i].astype("category")
test['averageorders'].describe()
count 5955.000000 mean 60.636104 std 35.643257 min 1.000000 25% 41.000000 50% 51.000000 75% 73.000000 max 486.000000 Name: averageorders, dtype: float64
# Identify most profitable customers
profitable_customers = test[test['averageorders'] > 100].reset_index()
# Optimize inventory management
large_order_customers = test[test['averageorders'] > 50]
small_order_customers = test[test['averageorders'] <= 50]
# Improve customer experience based on order quantity
test['customer_segment'] = test.apply(lambda row: 'large_order_customer' if row['OrderQuantity'] > 50 else 'small_order_customer', axis=1)
test['customer_segment']=test['customer_segment'].astype("category")
test["Time_to_First_Email"]=(test["first_transaction_date"]-test["firstemailDatetime"]).dt.days
test["Time_to_First_Email"]=test["Time_to_First_Email"].astype("int")
test["Time_to_First_Email"] = np.where(test["Time_to_First_Email"] < 0, 0, test["Time_to_First_Email"])
test["Time_since_Last_Email"]=(test["last_transaction_date"]-test["lastemailDatetime"]).dt.days
test["Time_since_Last_Email"]=test["Time_since_Last_Email"].astype("int")
# Loop over the rows of the dataframe and create a new column 'most_ordered_quarter'
for index, row in test.iterrows():
if row['Totalorderquantity_in_first_quarter'] > row['Totalorderquantity_in_second_quarter'] and row['Totalorderquantity_in_first_quarter'] > row['Totalorderquantity_in_Third_quarter']:
test.at[index, 'most_ordered_quarter'] = 'Q1'
elif row['Totalorderquantity_in_second_quarter'] > row['Totalorderquantity_in_first_quarter'] and row['Totalorderquantity_in_second_quarter'] > row['Totalorderquantity_in_Third_quarter']:
test.at[index, 'most_ordered_quarter'] = 'Q2'
else:
test.at[index, 'most_ordered_quarter'] = 'Q3'
test['most_ordered_quarter']=test['most_ordered_quarter'].astype("category")
test['City']=test['City'].astype("category")
test['EmailType']=test['EmailType'].astype("category")
test['MailOpened']=test['MailOpened'].astype("category")
test['MailClicked']=test['MailClicked'].astype("category")
test
| CustomerID | City | DateOfRegistration | OnlineCommunication | AutomaticRefill | DoorstepDelivery | PreferredDeliveryDay | EmailType | MailOpened | MailClicked | emailDatetime | numofemails | lastemailDatetime | firstemailDatetime | average_time_between_mails | OrderQuantity | last_transaction_date | num_transactions | totalorders | minimumorderquantity | maximumorderquantity | first_transaction_date | average_time_between_purchase | averageorders | Totalmails-notoponed | Totalmails-oponed | timesclicked-yes | timesclicked-no | days_since_first_transaction | email_duration | duration_of_customer_relation | recency | days_since_first_transaction_category | email_duration_category | Totalorderquantity_in_first_quarter | Totalorderquantity_in_second_quarter | Totalorderquantity_in_Third_quarter | customer_segment | Time_to_First_Email | Time_since_Last_Email | most_ordered_quarter | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | C17838 | CITY4 | 2010-10-22 | no | no | no | Thursday | WelcomeEmail | yes | NotApplicable | 2010-10-22 16:59:02 | 1 | 2010-10-22 16:59:02 | 2010-10-22 16:59:02 | 0.000000 | 55 | 2011-03-28 17:41:08 | 1 | 55 | 55 | 55 | 2011-03-28 17:41:08 | 0.000000 | 55 | 0 | 1 | 0 | 0 | 0 | 0 | 157 | 2490 | Only transaction | Only email | 55 | 0 | 0 | large_order_customer | 157 | 157 | Q1 |
| 1 | C12403 | CITY2 | 2008-11-17 | yes | yes | no | Wednesday | CustomizedEmail | yes | no | 2013-12-29 13:32:42 | 46 | 2013-12-29 13:32:42 | 2008-11-25 15:40:12 | 40.434783 | 182 | 2014-01-14 10:57:06 | 168 | 29537 | 150 | 206 | 2010-10-12 15:58:02 | 7.083333 | 175 | 9 | 38 | 39 | 7 | 1189 | 1859 | 1884 | 1468 | Long-term History | very Long mail History | 7507 | 11045 | 10985 | large_order_customer | 686 | 15 | Q2 |
| 2 | C19041 | CITY4 | 2011-01-24 | no | yes | no | Friday | CustomizedEmail | no | no | 2013-12-31 13:02:12 | 60 | 2013-12-31 13:02:12 | 2011-02-07 11:00:44 | 17.633333 | 107 | 2014-01-16 17:46:14 | 122 | 14321 | 93 | 142 | 2011-05-16 14:36:36 | 8.000000 | 117 | 34 | 27 | 56 | 4 | 976 | 1058 | 1088 | 1465 | Long-term History | very Long mail History | 3812 | 4783 | 5726 | large_order_customer | 98 | 16 | Q3 |
| 3 | C24069 | CITY2 | 2010-09-18 | no | no | no | Wednesday | CustomizedEmail | no | no | 2010-11-08 17:25:24 | 45 | 2010-11-08 17:25:24 | 2010-09-19 10:07:08 | 1.111111 | 47 | 2010-11-09 18:31:18 | 1 | 47 | 47 | 47 | 2010-11-09 18:31:18 | 0.000000 | 47 | 45 | 1 | 45 | 0 | 0 | 50 | 52 | 2629 | Only transaction | long mail history | 0 | 0 | 47 | small_order_customer | 51 | 1 | Q3 |
| 4 | C26718 | CITY2 | 2010-10-02 | yes | no | no | Monday | CustomizedEmail | yes | no | 2014-01-12 11:33:42 | 13 | 2014-01-12 11:33:42 | 2011-03-26 10:20:28 | 78.692308 | 159 | 2014-01-12 13:04:04 | 129 | 20238 | 125 | 185 | 2011-01-25 17:08:08 | 8.395349 | 156 | 5 | 9 | 11 | 2 | 1082 | 1023 | 1198 | 1470 | Long-term History | very Long mail History | 6096 | 6616 | 7526 | large_order_customer | 0 | 0 | Q3 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5950 | C32947 | CITY2 | 2018-01-17 | yes | no | no | Friday | WelcomeEmail | yes | NotApplicable | 2018-01-17 13:05:56 | 1 | 2018-01-17 13:05:56 | 2018-01-17 13:05:56 | 0.000000 | 41 | 2018-01-17 21:35:26 | 1 | 41 | 41 | 41 | 2018-01-17 21:35:26 | 0.000000 | 41 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 3 | Only transaction | Only email | 41 | 0 | 0 | small_order_customer | 0 | 0 | Q1 |
| 5951 | C28951 | CITY3 | 2018-01-17 | yes | no | no | Friday | WelcomeEmail | yes | NotApplicable | 2018-01-17 15:45:18 | 1 | 2018-01-17 15:45:18 | 2018-01-17 15:45:18 | 0.000000 | 40 | 2018-01-17 17:05:50 | 1 | 40 | 40 | 40 | 2018-01-17 17:05:50 | 0.000000 | 40 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 3 | Only transaction | Only email | 40 | 0 | 0 | small_order_customer | 0 | 0 | Q1 |
| 5952 | C38741 | CITY3 | 2018-01-17 | yes | no | no | Monday | WelcomeEmail | yes | NotApplicable | 2018-01-17 11:51:28 | 1 | 2018-01-17 11:51:28 | 2018-01-17 11:51:28 | 0.000000 | 36 | 2018-01-17 18:56:16 | 1 | 36 | 36 | 36 | 2018-01-17 18:56:16 | 0.000000 | 36 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 3 | Only transaction | Only email | 36 | 0 | 0 | small_order_customer | 0 | 0 | Q1 |
| 5953 | C20407 | CITY2 | 2018-01-17 | yes | no | no | Friday | WelcomeEmail | yes | NotApplicable | 2018-01-17 10:25:30 | 1 | 2018-01-17 10:25:30 | 2018-01-17 10:25:30 | 0.000000 | 75 | 2018-01-17 10:24:52 | 1 | 75 | 75 | 75 | 2018-01-17 10:24:52 | 0.000000 | 75 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 4 | Only transaction | Only email | 75 | 0 | 0 | large_order_customer | 0 | -1 | Q1 |
| 5954 | C27220 | CITY3 | 2018-01-17 | yes | no | no | Friday | WelcomeEmail | yes | NotApplicable | 2018-01-17 14:36:32 | 1 | 2018-01-17 14:36:32 | 2018-01-17 14:36:32 | 0.000000 | 40 | 2018-01-17 12:04:10 | 1 | 40 | 40 | 40 | 2018-01-17 12:04:10 | 0.000000 | 40 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 4 | Only transaction | Only email | 40 | 0 | 0 | small_order_customer | 0 | -1 | Q1 |
5955 rows × 41 columns
test['Registration_month']=test['DateOfRegistration'].dt.month
test['RegistrationYear']=test['DateOfRegistration'].dt.year
test['Registration_weekday']=test['DateOfRegistration'].dt.weekday
test['lastemail_month']=test['lastemailDatetime'].dt.month
test['lastemail_weekday']=test['lastemailDatetime'].dt.weekday
test['transaction_month']=test['last_transaction_date'].dt.month
test['transaction_weekday']=test['last_transaction_date'].dt.weekday
test=test.drop(["last_transaction_date","lastemailDatetime","DateOfRegistration"],axis=1)
datetypecolumns
test=test.drop(['firstemailDatetime',"first_transaction_date","emailDatetime"],axis=1)
for col in ['Registration_month','RegistrationYear',
'Registration_weekday', 'lastemail_month', 'lastemail_weekday',
'transaction_month', 'transaction_weekday']:
test[col] =test[col].astype('category')
#feature elimination for numerical columns
test=test.drop(['Totalorderquantity_in_first_quarter',
'Totalorderquantity_in_second_quarter',
'Totalorderquantity_in_Third_quarter','minimumorderquantity', 'maximumorderquantity','averageorders'],axis=1)
test['open_rate'] = np.log(test['Totalmails-oponed'] / test['numofemails'])
test['click_through_rate'] = np.log(test['totalorders'] / test['Totalmails-oponed'])
test['response_rate'] = np.log(test['totalorders'] / test['numofemails'])
test['open_rate_ratio'] =np.log(test['Totalmails-oponed'] / test['Totalmails-notoponed'])
test['click_through_rate_ratio'] = np.log(test['timesclicked-yes'] / test['timesclicked-no'])
test['response_rate_ratio'] = np.log(test['totalorders']) / (test['numofemails'] - test['totalorders'])
# Replace inf and NaN values in selected columns
cols = ['open_rate_ratio', 'click_through_rate_ratio', 'response_rate_ratio','click_through_rate','open_rate','response_rate']
replace_value = 0 # Replace with NaN, or any other value you want
test[cols] = test[cols].replace([np.inf, -np.inf], 0)
test[cols] = test[cols].fillna(replace_value)
test1=test.drop(['Totalmails-notoponed','Totalmails-oponed', 'timesclicked-yes', 'timesclicked-no'],axis=1)
test1 = test1.rename(columns={'totalorders': 'sumoforderquantity'})
len(test1.columns)
38
len(train.columns)#duration_of_customer_relation,recency,#retentionpotenial,RegistrationYear
39
train_columns = set(train.columns)
test_columns = set(test1.columns)
missing_columns = train_columns - test_columns
print("Columns missing in test data:", missing_columns)
Columns missing in test data: {'RetentionPotential'}
# split into input (X) and output (y) variables
test1=test1.drop(['CustomerID','days_since_first_transaction_category', 'email_duration_category'], axis=1)
cat_cols
Index(['City', 'OnlineCommunication', 'AutomaticRefill', 'DoorstepDelivery',
'PreferredDeliveryDay', 'EmailType', 'MailOpened', 'MailClicked',
'RegistrationYear', 'customer_segment', 'most_ordered_quarter',
'Registration_month', 'Registration_weekday', 'lastemail_month',
'lastemail_weekday', 'transaction_month', 'transaction_weekday'],
dtype='object')
cat_cols=test1.select_dtypes(include=['category'])
cat_cols=cat_cols.columns
num_cols=test1.select_dtypes(include=['int64', 'float64'])
num_cols=num_cols.drop(['open_rate','click_through_rate', 'response_rate', 'open_rate_ratio','click_through_rate_ratio', 'response_rate_ratio'],axis=1)
num_cols=num_cols.columns
## Convert Categorical Columns to Dummies
test1 = pd.get_dummies(test1, columns=cat_cols,drop_first=True)
#scaling
scaler = StandardScaler()
# scale on train
test1[num_cols] =scaler.fit_transform(test1[num_cols])
len(test1.columns)
97
len(X.columns)
97
test1
| numofemails | average_time_between_mails | OrderQuantity | num_transactions | sumoforderquantity | average_time_between_purchase | days_since_first_transaction | email_duration | duration_of_customer_relation | recency | Time_to_First_Email | Time_since_Last_Email | open_rate | click_through_rate | response_rate | open_rate_ratio | click_through_rate_ratio | response_rate_ratio | City_CITY2 | City_CITY3 | City_CITY4 | OnlineCommunication_yes | AutomaticRefill_yes | DoorstepDelivery_yes | PreferredDeliveryDay_Monday | PreferredDeliveryDay_Thursday | PreferredDeliveryDay_Tuesday | PreferredDeliveryDay_Wednesday | PreferredDeliveryDay_weekend | EmailType_WelcomeEmail | MailOpened_yes | MailClicked_no | MailClicked_yes | customer_segment_small_order_customer | most_ordered_quarter_Q2 | most_ordered_quarter_Q3 | Registration_month_2 | Registration_month_3 | Registration_month_4 | Registration_month_5 | Registration_month_6 | Registration_month_7 | Registration_month_8 | Registration_month_9 | Registration_month_10 | Registration_month_11 | Registration_month_12 | RegistrationYear_2009 | RegistrationYear_2010 | RegistrationYear_2011 | RegistrationYear_2012 | RegistrationYear_2013 | RegistrationYear_2014 | RegistrationYear_2015 | RegistrationYear_2016 | RegistrationYear_2017 | RegistrationYear_2018 | Registration_weekday_1 | Registration_weekday_2 | Registration_weekday_3 | Registration_weekday_4 | Registration_weekday_5 | Registration_weekday_6 | lastemail_month_2 | lastemail_month_3 | lastemail_month_4 | lastemail_month_5 | lastemail_month_6 | lastemail_month_7 | lastemail_month_8 | lastemail_month_9 | lastemail_month_10 | lastemail_month_11 | lastemail_month_12 | lastemail_weekday_1 | lastemail_weekday_2 | lastemail_weekday_3 | lastemail_weekday_4 | lastemail_weekday_5 | lastemail_weekday_6 | transaction_month_2 | transaction_month_3 | transaction_month_4 | transaction_month_5 | transaction_month_6 | transaction_month_7 | transaction_month_8 | transaction_month_9 | transaction_month_10 | transaction_month_11 | transaction_month_12 | transaction_weekday_1 | transaction_weekday_2 | transaction_weekday_3 | transaction_weekday_4 | transaction_weekday_5 | transaction_weekday_6 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1.002649 | -0.304959 | -0.159308 | -0.273265 | -0.194793 | -0.348208 | -0.388968 | -0.426962 | 0.107045 | 1.566055 | 0.955169 | 1.438968 | 0.000000 | 4.007333 | 4.007333 | 0.000000 | 0.000000 | -0.074210 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1.243485 | 3.622659 | 3.375352 | 12.783856 | 17.507358 | -0.142518 | 4.985016 | 7.179561 | 6.510322 | 0.062220 | 5.241099 | -0.041607 | -0.191055 | 6.655813 | 6.464758 | 1.440362 | 1.717651 | -0.000349 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 2 | 1.942282 | 1.407849 | 1.287954 | 9.187284 | 8.371074 | -0.115900 | 4.022309 | 3.902087 | 3.558956 | 0.057805 | 0.477154 | -0.031180 | -0.798508 | 6.273645 | 5.475138 | -0.230524 | 2.639057 | -0.000671 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 3 | 1.193571 | -0.197032 | -0.381964 | -0.273265 | -0.199596 | -0.348208 | -0.388968 | -0.222376 | -0.282269 | 1.770588 | 0.096362 | -0.187579 | -3.806662 | 3.850148 | 0.043485 | -3.806662 | 0.000000 | -1.925074 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 4 | -0.403680 | 7.338791 | 2.735216 | 9.734588 | 11.923873 | -0.104419 | 4.501402 | 3.758876 | 3.966808 | 0.065162 | -0.316837 | -0.198005 | -0.367725 | 7.718093 | 7.350368 | 0.587787 | 1.704748 | -0.000490 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5950 | -1.002649 | -0.304959 | -0.548956 | -0.273265 | -0.203199 | -0.348208 | -0.388968 | -0.426962 | -0.475072 | -2.093474 | -0.316837 | -0.198005 | 0.000000 | 3.713572 | 3.713572 | 0.000000 | 0.000000 | -0.092839 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 5951 | -1.002649 | -0.304959 | -0.576788 | -0.273265 | -0.203799 | -0.348208 | -0.388968 | -0.426962 | -0.475072 | -2.093474 | -0.316837 | -0.198005 | 0.000000 | 3.688879 | 3.688879 | 0.000000 | 0.000000 | -0.094587 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 5952 | -1.002649 | -0.304959 | -0.688116 | -0.273265 | -0.206201 | -0.348208 | -0.388968 | -0.426962 | -0.475072 | -2.093474 | -0.316837 | -0.198005 | 0.000000 | 3.583519 | 3.583519 | 0.000000 | 0.000000 | -0.102386 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 5953 | -1.002649 | -0.304959 | 0.397331 | -0.273265 | -0.182784 | -0.348208 | -0.388968 | -0.426962 | -0.475072 | -2.092002 | -0.316837 | -0.208432 | 0.000000 | 4.317488 | 4.317488 | 0.000000 | 0.000000 | -0.058344 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 5954 | -1.002649 | -0.304959 | -0.576788 | -0.273265 | -0.203799 | -0.348208 | -0.388968 | -0.426962 | -0.475072 | -2.092002 | -0.316837 | -0.208432 | 0.000000 | 3.688879 | 3.688879 | 0.000000 | 0.000000 | -0.094587 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
5955 rows × 97 columns
train_columns = set(test1.columns)
test_columns = set(X.columns)
missing_columns = train_columns - test_columns
print("Columns missing in test data:", missing_columns)
Columns missing in test data: set()
test1.columns
Index(['numofemails', 'average_time_between_mails', 'OrderQuantity',
'num_transactions', 'sumoforderquantity',
'average_time_between_purchase', 'days_since_first_transaction',
'email_duration', 'duration_of_customer_relation', 'recency',
'Time_to_First_Email', 'Time_since_Last_Email', 'open_rate',
'click_through_rate', 'response_rate', 'open_rate_ratio',
'click_through_rate_ratio', 'response_rate_ratio', 'City_CITY2',
'City_CITY3', 'City_CITY4', 'OnlineCommunication_yes',
'AutomaticRefill_yes', 'DoorstepDelivery_yes',
'PreferredDeliveryDay_Monday', 'PreferredDeliveryDay_Thursday',
'PreferredDeliveryDay_Tuesday', 'PreferredDeliveryDay_Wednesday',
'PreferredDeliveryDay_weekend', 'EmailType_WelcomeEmail',
'MailOpened_yes', 'MailClicked_no', 'MailClicked_yes',
'customer_segment_small_order_customer', 'most_ordered_quarter_Q2',
'most_ordered_quarter_Q3', 'Registration_month_2',
'Registration_month_3', 'Registration_month_4', 'Registration_month_5',
'Registration_month_6', 'Registration_month_7', 'Registration_month_8',
'Registration_month_9', 'Registration_month_10',
'Registration_month_11', 'Registration_month_12',
'RegistrationYear_2009', 'RegistrationYear_2010',
'RegistrationYear_2011', 'RegistrationYear_2012',
'RegistrationYear_2013', 'RegistrationYear_2014',
'RegistrationYear_2015', 'RegistrationYear_2016',
'RegistrationYear_2017', 'RegistrationYear_2018',
'Registration_weekday_1', 'Registration_weekday_2',
'Registration_weekday_3', 'Registration_weekday_4',
'Registration_weekday_5', 'Registration_weekday_6', 'lastemail_month_2',
'lastemail_month_3', 'lastemail_month_4', 'lastemail_month_5',
'lastemail_month_6', 'lastemail_month_7', 'lastemail_month_8',
'lastemail_month_9', 'lastemail_month_10', 'lastemail_month_11',
'lastemail_month_12', 'lastemail_weekday_1', 'lastemail_weekday_2',
'lastemail_weekday_3', 'lastemail_weekday_4', 'lastemail_weekday_5',
'lastemail_weekday_6', 'transaction_month_2', 'transaction_month_3',
'transaction_month_4', 'transaction_month_5', 'transaction_month_6',
'transaction_month_7', 'transaction_month_8', 'transaction_month_9',
'transaction_month_10', 'transaction_month_11', 'transaction_month_12',
'transaction_weekday_1', 'transaction_weekday_2',
'transaction_weekday_3', 'transaction_weekday_4',
'transaction_weekday_5', 'transaction_weekday_6'],
dtype='object')
X.columns
Index(['numofemails', 'average_time_between_mails', 'OrderQuantity',
'num_transactions', 'sumoforderquantity',
'average_time_between_purchase', 'days_since_first_transaction',
'email_duration', 'duration_of_customer_relation', 'recency',
'Time_to_First_Email', 'Time_since_Last_Email', 'open_rate',
'click_through_rate', 'response_rate', 'open_rate_ratio',
'click_through_rate_ratio', 'response_rate_ratio', 'City_CITY2',
'City_CITY3', 'City_CITY4', 'OnlineCommunication_yes',
'AutomaticRefill_yes', 'DoorstepDelivery_yes',
'PreferredDeliveryDay_Monday', 'PreferredDeliveryDay_Thursday',
'PreferredDeliveryDay_Tuesday', 'PreferredDeliveryDay_Wednesday',
'PreferredDeliveryDay_weekend', 'EmailType_WelcomeEmail',
'MailOpened_yes', 'MailClicked_no', 'MailClicked_yes',
'RegistrationYear_2009', 'RegistrationYear_2010',
'RegistrationYear_2011', 'RegistrationYear_2012',
'RegistrationYear_2013', 'RegistrationYear_2014',
'RegistrationYear_2015', 'RegistrationYear_2016',
'RegistrationYear_2017', 'RegistrationYear_2018',
'customer_segment_small_order_customer', 'most_ordered_quarter_Q2',
'most_ordered_quarter_Q3', 'Registration_month_2',
'Registration_month_3', 'Registration_month_4', 'Registration_month_5',
'Registration_month_6', 'Registration_month_7', 'Registration_month_8',
'Registration_month_9', 'Registration_month_10',
'Registration_month_11', 'Registration_month_12',
'Registration_weekday_1', 'Registration_weekday_2',
'Registration_weekday_3', 'Registration_weekday_4',
'Registration_weekday_5', 'Registration_weekday_6', 'lastemail_month_2',
'lastemail_month_3', 'lastemail_month_4', 'lastemail_month_5',
'lastemail_month_6', 'lastemail_month_7', 'lastemail_month_8',
'lastemail_month_9', 'lastemail_month_10', 'lastemail_month_11',
'lastemail_month_12', 'lastemail_weekday_1', 'lastemail_weekday_2',
'lastemail_weekday_3', 'lastemail_weekday_4', 'lastemail_weekday_5',
'lastemail_weekday_6', 'transaction_month_2', 'transaction_month_3',
'transaction_month_4', 'transaction_month_5', 'transaction_month_6',
'transaction_month_7', 'transaction_month_8', 'transaction_month_9',
'transaction_month_10', 'transaction_month_11', 'transaction_month_12',
'transaction_weekday_1', 'transaction_weekday_2',
'transaction_weekday_3', 'transaction_weekday_4',
'transaction_weekday_5', 'transaction_weekday_6'],
dtype='object')
# create a list of feature names from the column names of X_train
feature_names = list(X_train.columns)
# check the feature names used during training
print("Feature names used during training:", feature_names)
# check the feature names in the test data
print("Feature names in test data:", list(test1.columns))
# reorder the columns of the test data to match the order of the feature names in feature_names
test1 = test1[feature_names]
# check the reordered feature names in the test data
print("Feature names in test data after reordering:", list(test1.columns))
Feature names used during training: ['numofemails', 'average_time_between_mails', 'OrderQuantity', 'num_transactions', 'sumoforderquantity', 'average_time_between_purchase', 'days_since_first_transaction', 'email_duration', 'duration_of_customer_relation', 'recency', 'Time_to_First_Email', 'Time_since_Last_Email', 'open_rate', 'click_through_rate', 'response_rate', 'open_rate_ratio', 'click_through_rate_ratio', 'response_rate_ratio', 'City_CITY2', 'City_CITY3', 'City_CITY4', 'OnlineCommunication_yes', 'AutomaticRefill_yes', 'DoorstepDelivery_yes', 'PreferredDeliveryDay_Monday', 'PreferredDeliveryDay_Thursday', 'PreferredDeliveryDay_Tuesday', 'PreferredDeliveryDay_Wednesday', 'PreferredDeliveryDay_weekend', 'EmailType_WelcomeEmail', 'MailOpened_yes', 'MailClicked_no', 'MailClicked_yes', 'RegistrationYear_2009', 'RegistrationYear_2010', 'RegistrationYear_2011', 'RegistrationYear_2012', 'RegistrationYear_2013', 'RegistrationYear_2014', 'RegistrationYear_2015', 'RegistrationYear_2016', 'RegistrationYear_2017', 'RegistrationYear_2018', 'customer_segment_small_order_customer', 'most_ordered_quarter_Q2', 'most_ordered_quarter_Q3', 'Registration_month_2', 'Registration_month_3', 'Registration_month_4', 'Registration_month_5', 'Registration_month_6', 'Registration_month_7', 'Registration_month_8', 'Registration_month_9', 'Registration_month_10', 'Registration_month_11', 'Registration_month_12', 'Registration_weekday_1', 'Registration_weekday_2', 'Registration_weekday_3', 'Registration_weekday_4', 'Registration_weekday_5', 'Registration_weekday_6', 'lastemail_month_2', 'lastemail_month_3', 'lastemail_month_4', 'lastemail_month_5', 'lastemail_month_6', 'lastemail_month_7', 'lastemail_month_8', 'lastemail_month_9', 'lastemail_month_10', 'lastemail_month_11', 'lastemail_month_12', 'lastemail_weekday_1', 'lastemail_weekday_2', 'lastemail_weekday_3', 'lastemail_weekday_4', 'lastemail_weekday_5', 'lastemail_weekday_6', 'transaction_month_2', 'transaction_month_3', 'transaction_month_4', 'transaction_month_5', 'transaction_month_6', 'transaction_month_7', 'transaction_month_8', 'transaction_month_9', 'transaction_month_10', 'transaction_month_11', 'transaction_month_12', 'transaction_weekday_1', 'transaction_weekday_2', 'transaction_weekday_3', 'transaction_weekday_4', 'transaction_weekday_5', 'transaction_weekday_6'] Feature names in test data: ['numofemails', 'average_time_between_mails', 'OrderQuantity', 'num_transactions', 'sumoforderquantity', 'average_time_between_purchase', 'days_since_first_transaction', 'email_duration', 'duration_of_customer_relation', 'recency', 'Time_to_First_Email', 'Time_since_Last_Email', 'open_rate', 'click_through_rate', 'response_rate', 'open_rate_ratio', 'click_through_rate_ratio', 'response_rate_ratio', 'City_CITY2', 'City_CITY3', 'City_CITY4', 'OnlineCommunication_yes', 'AutomaticRefill_yes', 'DoorstepDelivery_yes', 'PreferredDeliveryDay_Monday', 'PreferredDeliveryDay_Thursday', 'PreferredDeliveryDay_Tuesday', 'PreferredDeliveryDay_Wednesday', 'PreferredDeliveryDay_weekend', 'EmailType_WelcomeEmail', 'MailOpened_yes', 'MailClicked_no', 'MailClicked_yes', 'customer_segment_small_order_customer', 'most_ordered_quarter_Q2', 'most_ordered_quarter_Q3', 'Registration_month_2', 'Registration_month_3', 'Registration_month_4', 'Registration_month_5', 'Registration_month_6', 'Registration_month_7', 'Registration_month_8', 'Registration_month_9', 'Registration_month_10', 'Registration_month_11', 'Registration_month_12', 'RegistrationYear_2009', 'RegistrationYear_2010', 'RegistrationYear_2011', 'RegistrationYear_2012', 'RegistrationYear_2013', 'RegistrationYear_2014', 'RegistrationYear_2015', 'RegistrationYear_2016', 'RegistrationYear_2017', 'RegistrationYear_2018', 'Registration_weekday_1', 'Registration_weekday_2', 'Registration_weekday_3', 'Registration_weekday_4', 'Registration_weekday_5', 'Registration_weekday_6', 'lastemail_month_2', 'lastemail_month_3', 'lastemail_month_4', 'lastemail_month_5', 'lastemail_month_6', 'lastemail_month_7', 'lastemail_month_8', 'lastemail_month_9', 'lastemail_month_10', 'lastemail_month_11', 'lastemail_month_12', 'lastemail_weekday_1', 'lastemail_weekday_2', 'lastemail_weekday_3', 'lastemail_weekday_4', 'lastemail_weekday_5', 'lastemail_weekday_6', 'transaction_month_2', 'transaction_month_3', 'transaction_month_4', 'transaction_month_5', 'transaction_month_6', 'transaction_month_7', 'transaction_month_8', 'transaction_month_9', 'transaction_month_10', 'transaction_month_11', 'transaction_month_12', 'transaction_weekday_1', 'transaction_weekday_2', 'transaction_weekday_3', 'transaction_weekday_4', 'transaction_weekday_5', 'transaction_weekday_6'] Feature names in test data after reordering: ['numofemails', 'average_time_between_mails', 'OrderQuantity', 'num_transactions', 'sumoforderquantity', 'average_time_between_purchase', 'days_since_first_transaction', 'email_duration', 'duration_of_customer_relation', 'recency', 'Time_to_First_Email', 'Time_since_Last_Email', 'open_rate', 'click_through_rate', 'response_rate', 'open_rate_ratio', 'click_through_rate_ratio', 'response_rate_ratio', 'City_CITY2', 'City_CITY3', 'City_CITY4', 'OnlineCommunication_yes', 'AutomaticRefill_yes', 'DoorstepDelivery_yes', 'PreferredDeliveryDay_Monday', 'PreferredDeliveryDay_Thursday', 'PreferredDeliveryDay_Tuesday', 'PreferredDeliveryDay_Wednesday', 'PreferredDeliveryDay_weekend', 'EmailType_WelcomeEmail', 'MailOpened_yes', 'MailClicked_no', 'MailClicked_yes', 'RegistrationYear_2009', 'RegistrationYear_2010', 'RegistrationYear_2011', 'RegistrationYear_2012', 'RegistrationYear_2013', 'RegistrationYear_2014', 'RegistrationYear_2015', 'RegistrationYear_2016', 'RegistrationYear_2017', 'RegistrationYear_2018', 'customer_segment_small_order_customer', 'most_ordered_quarter_Q2', 'most_ordered_quarter_Q3', 'Registration_month_2', 'Registration_month_3', 'Registration_month_4', 'Registration_month_5', 'Registration_month_6', 'Registration_month_7', 'Registration_month_8', 'Registration_month_9', 'Registration_month_10', 'Registration_month_11', 'Registration_month_12', 'Registration_weekday_1', 'Registration_weekday_2', 'Registration_weekday_3', 'Registration_weekday_4', 'Registration_weekday_5', 'Registration_weekday_6', 'lastemail_month_2', 'lastemail_month_3', 'lastemail_month_4', 'lastemail_month_5', 'lastemail_month_6', 'lastemail_month_7', 'lastemail_month_8', 'lastemail_month_9', 'lastemail_month_10', 'lastemail_month_11', 'lastemail_month_12', 'lastemail_weekday_1', 'lastemail_weekday_2', 'lastemail_weekday_3', 'lastemail_weekday_4', 'lastemail_weekday_5', 'lastemail_weekday_6', 'transaction_month_2', 'transaction_month_3', 'transaction_month_4', 'transaction_month_5', 'transaction_month_6', 'transaction_month_7', 'transaction_month_8', 'transaction_month_9', 'transaction_month_10', 'transaction_month_11', 'transaction_month_12', 'transaction_weekday_1', 'transaction_weekday_2', 'transaction_weekday_3', 'transaction_weekday_4', 'transaction_weekday_5', 'transaction_weekday_6']
# make predictions on the test data
y_pred = clf_dt_grid.predict(test1)
print("Predictions:", y_pred)
Predictions: [0 1 1 ... 0 0 0]
csv=pd.read_excel("/content/Test.xlsx")
train.columns
Index(['CustomerID', 'City', 'OnlineCommunication', 'AutomaticRefill',
'DoorstepDelivery', 'PreferredDeliveryDay', 'RetentionPotential',
'EmailType', 'MailOpened', 'MailClicked', 'numofemails',
'average_time_between_mails', 'OrderQuantity', 'num_transactions',
'sumoforderquantity', 'average_time_between_purchase',
'days_since_first_transaction', 'email_duration',
'duration_of_customer_relation', 'recency',
'days_since_first_transaction_category', 'email_duration_category',
'RegistrationYear', 'customer_segment', 'Time_to_First_Email',
'Time_since_Last_Email', 'most_ordered_quarter', 'Registration_month',
'Registration_weekday', 'lastemail_month', 'lastemail_weekday',
'transaction_month', 'transaction_weekday', 'open_rate',
'click_through_rate', 'response_rate', 'open_rate_ratio',
'click_through_rate_ratio', 'response_rate_ratio'],
dtype='object')
csv["RetentionPotential"]=y_pred
csv["RetentionPotential"].value_counts()
1 3093 0 2414 2 448 Name: RetentionPotential, dtype: int64
#rfmsegmentation
#lets segment our customers bases upon their transaction records
#if we observe our last transaction was on date 2018-01-21
transaction["DateOfOrder"].describe()
count 26768 unique 2771 top 2014-01-13 00:00:00 freq 82 first 2008-08-05 00:00:00 last 2018-01-21 00:00:00 Name: DateOfOrder, dtype: object
#lets check days since first transaction for each record then we will figure out accordingly.
transaction['days_since'] = pd.to_datetime(transaction['DateOfOrder']).apply(lambda x: (pd.to_datetime("2018-01-21") - x).days)
## Compute recency, frequency, and monetary_value
def rec_freq_value(x):
recency = x['days_since'].min()#how recently a customer was active
first_purchase = x['days_since'].max()#out off all the days,the max difference is the first purchase he had made
frequency = x['days_since'].count()#how many times a customer had transacted between his first and last
order_value = x['OrderQuantity'].mean()#how much quantity he had ordered,average of the total orders he had ordered
c = ['recency', 'first_purchase', 'frequency', 'order_value']
return pd.Series([recency, first_purchase, frequency, order_value], index = c)
#applying the function on my whole records
RFM = transaction.groupby('CustomerID').apply(rec_freq_value)
#lets combine email behaviour with rfm
emailresponses=test[['CustomerID','Totalmails-notoponed', 'Totalmails-oponed', 'timesclicked-yes', 'timesclicked-no','numofemails']]
# Combine RFM scores and email response data
# Combine RFM scores and email response data
rfm_email = pd.merge(RFM,emailresponses, on=["CustomerID"])
rfm_email=rfm_email.drop(["CustomerID"],axis=1)
# Normalize the RFM scores and email response data
rfm_email_norm = (rfm_email - rfm_email.mean()) / rfm_email.std()
test_labels=KMean_clust.predict(rfm_email_norm)
#Find the clusters for the observation given in the dataset
rfm_email['Cluster'] = test_labels
rfm_email.head()
| recency | first_purchase | frequency | order_value | Totalmails-notoponed | Totalmails-oponed | timesclicked-yes | timesclicked-no | numofemails | Cluster | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 38.0 | 38.0 | 1.0 | 52.0 | 16 | 9 | 21 | 3 | 24 | 0 |
| 1 | 1554.0 | 1554.0 | 1.0 | 40.0 | 0 | 1 | 0 | 0 | 1 | 4 |
| 2 | 1586.0 | 1586.0 | 1.0 | 40.0 | 0 | 1 | 0 | 0 | 1 | 4 |
| 3 | 11.0 | 11.0 | 1.0 | 41.0 | 21 | 9 | 24 | 5 | 29 | 0 |
| 4 | 1555.0 | 1555.0 | 1.0 | 79.0 | 38 | 4 | 40 | 1 | 41 | 1 |
from matplotlib import pyplot as plt
plt.figure(figsize=(7,7))
##Scatter Plot Frequency Vs Recency
Colors = ["red", "green", "blue","black","yellow"]
rfm_email['Color'] = rfm_email['Cluster'].map(lambda p: Colors[p])
ax = rfm_email.plot(
kind="scatter",
x="recency", y="frequency",
figsize=(10,8),
c = rfm_email['Color']
)
<Figure size 700x700 with 0 Axes>
ax = rfm_email.plot(
kind="scatter",
x="order_value", y="frequency",
figsize=(10,8),
c = rfm_email['Color']
)
rfm_email['Cluster Name'] = ''
rfm_email['Cluster Name'][rfm_email['Cluster']==0] = 'midfrequency'
rfm_email['Cluster Name'][rfm_email['Cluster']==1] = 'Higherrecency-lessfrequent '
rfm_email['Cluster Name'][rfm_email['Cluster']==2] = 'recenttransactionswithlessfrequency'
rfm_email['Cluster Name'][rfm_email['Cluster']==3] = 'midfrequencywithrecenttransactionatintervals'
rfm_email['Cluster Name'][rfm_email['Cluster']==4] = 'higherfrequencyatdifferentintervals'
csv = csv.reset_index().drop('index', axis=1)
csv['Cluster Name'] = rfm_email['Cluster Name']
csv['Cluster Name']=csv['Cluster Name'].astype("category")
csv
| CustomerID | RetentionPotential | Cluster Name | |
|---|---|---|---|
| 0 | C12121 | 0 | midfrequency |
| 1 | C12132 | 1 | higherfrequencyatdifferentintervals |
| 2 | C12134 | 1 | higherfrequencyatdifferentintervals |
| 3 | C12135 | 1 | midfrequency |
| 4 | C12136 | 2 | Higherrecency-lessfrequent |
| ... | ... | ... | ... |
| 5950 | C42153 | 0 | higherfrequencyatdifferentintervals |
| 5951 | C42155 | 0 | higherfrequencyatdifferentintervals |
| 5952 | C42157 | 0 | higherfrequencyatdifferentintervals |
| 5953 | C42167 | 0 | Higherrecency-lessfrequent |
| 5954 | C42190 | 0 | higherfrequencyatdifferentintervals |
5955 rows × 3 columns
# save dataframe to a csv file
csv.to_csv('my_final_predictions-withclusters.csv', index=False)